Title: | Random Projection Ensemble Classification |
---|---|
Description: | Implements the methodology of "Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959--1035". The random projection ensemble classifier is a general method for classification of high-dimensional data, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. The random projections are divided into non-overlapping blocks, and within each block the projection yielding the smallest estimate of the test error is selected. The random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. |
Authors: | Timothy I. Cannings and Richard J. Samworth |
Maintainer: | Timothy I. Cannings <[email protected]> |
License: | GPL-3 |
Version: | 0.5 |
Built: | 2025-02-15 04:09:08 UTC |
Source: | https://github.com/cran/RPEnsemble |
Implements the methodology of Cannings and Samworth (2017). The random projection ensemble classifier is a very general method for classification of high-dimensional data, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. The random projections are divided into non-overlapping blocks, and within each block the projection yielding the smallest estimate of the test error is selected. The random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment.
RPChoose
chooses the projection from a block of size B2
that minimises an estimate of the test error (see Cannings and Samworth, 2017, Section 3), and classifies the training and test sets using the base classifier on the projected data. RPParallel
makes many calls to RPChoose
in parallel. RPalpha
chooses the best empirical value of alpha (see Cannings and Samworth, 2017, Section 5.1). RPEnsembleClass
combines the results of many base classifications to classify the test set.
The method can be used with any base classifier, any test error estimate and any distribution of the random projections. This package provides code for the following options: Classifiers – linear discriminant analysis, quadratic discriminant analysis and the k-nearest neighbour classifier. Error estimates – resubstitution and leave-one-out, we also provide code for the sample-splitting method described in Cannings and Samworth (2017, Section 7) (this can be done by setting estmethod = samplesplit
). Projection distribution – Haar, Gaussian or axis-aligned projections.
The package provides the option to add your own base classifier and estimation method, this can be done by editing the code in the function Other.classifier
. Moreover, one could edit the RPGenerate
function to generate projections from different distributions.
Timothy I. Cannings and Richard J. Samworth
Maintainer: Timothy I. Cannings <[email protected]>
Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959–1035
#generate data from Model 1 set.seed(101) Train <- RPModel(2, 50, 100, 0.5) Test <- RPModel(2, 100, 100, 0.5) #Classify the training and test set for B1 = 10 independent projections, each #one carefully chosen from a block of size B2 = 10, using the "knn" base #classifier and the leave-one-out test error estimate Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 10, B2 = 10, base = "knn", projmethod = "Haar", estmethod = "loo", splitsample = FALSE, k = seq(1, 25, by = 3), clustertype = "Default") #estimate the class 1 prior probability phat <- sum(Train$y == 1)/50 #choose the best empirical value of the voting threshold alpha alphahat <- RPalpha(RP.out = Out, Y = Train$y, p1 = phat) #combine the base classifications Class <- RPEnsembleClass(RP.out = Out, n = 50, n.test = 100, p1 = phat, alpha = alphahat) #calculate the error mean(Class != Test$y) #Code for sample splitting version of the above #n.val <- 25 #s <- sample(1:50,25) #OutSS <- RPParallel(XTrain = Train$x[-s,], YTrain = Train$y[-s], #XVal = Train$x[s,], YVal = Train$y[s], XTest = Test$x, d = 2, #B1 = 50, B2 = 10, base = "knn", projmethod = "Haar", estmethod = "samplesplit", #k = seq(1,13, by = 2), clustertype = "Fork", cores = 1) #alphahatSS <- RPalpha(RP.out = OutSS, Y = Train$y[s], p1 = phat) #ClassSS <- RPEnsembleClass(RP.out = OutSS, n.val = 25, n.test = 100, #p1 = phat, samplesplit = TRUE, alpha = alphahatSS) #mean(ClassSS != Test$y)
#generate data from Model 1 set.seed(101) Train <- RPModel(2, 50, 100, 0.5) Test <- RPModel(2, 100, 100, 0.5) #Classify the training and test set for B1 = 10 independent projections, each #one carefully chosen from a block of size B2 = 10, using the "knn" base #classifier and the leave-one-out test error estimate Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 10, B2 = 10, base = "knn", projmethod = "Haar", estmethod = "loo", splitsample = FALSE, k = seq(1, 25, by = 3), clustertype = "Default") #estimate the class 1 prior probability phat <- sum(Train$y == 1)/50 #choose the best empirical value of the voting threshold alpha alphahat <- RPalpha(RP.out = Out, Y = Train$y, p1 = phat) #combine the base classifications Class <- RPEnsembleClass(RP.out = Out, n = 50, n.test = 100, p1 = phat, alpha = alphahat) #calculate the error mean(Class != Test$y) #Code for sample splitting version of the above #n.val <- 25 #s <- sample(1:50,25) #OutSS <- RPParallel(XTrain = Train$x[-s,], YTrain = Train$y[-s], #XVal = Train$x[s,], YVal = Train$y[s], XTest = Test$x, d = 2, #B1 = 50, B2 = 10, base = "knn", projmethod = "Haar", estmethod = "samplesplit", #k = seq(1,13, by = 2), clustertype = "Fork", cores = 1) #alphahatSS <- RPalpha(RP.out = OutSS, Y = Train$y[s], p1 = phat) #ClassSS <- RPEnsembleClass(RP.out = OutSS, n.val = 25, n.test = 100, #p1 = phat, samplesplit = TRUE, alpha = alphahatSS) #mean(ClassSS != Test$y)
User defined code to convert existing R
code for classification to the correct format
Other.classifier(x, grouping, xTest, CV, ...)
Other.classifier(x, grouping, xTest, CV, ...)
x |
An |
grouping |
A vector of length |
xTest |
An |
CV |
If |
... |
Optional arguments e.g. tuning parameters |
User editable code for your choice of base classifier.
class |
a vector of classes of the training or test set |
Timothy I. Cannings and Richard J. Samworth
Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959–1035
The 100 by 100 rotation matrix used in Model 2 in Cannings and Samworth (2017).
data(R)
data(R)
A 100 by 100 rotation matrix
Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959–1035
data(R) head(R%*%t(R))
data(R) head(R%*%t(R))
Chooses the best empirical value of the cutoff alpha
, based on the
leave-one-out, resubstitution or sample-split estimates of the class labels.
RPalpha(RP.out, Y, p1)
RPalpha(RP.out, Y, p1)
RP.out |
The result of a call to |
Y |
Vector of length |
p1 |
(Empirical) prior probability |
See precise details in Cannings and Samworth (2015, Section 5.1).
alpha |
The value of |
Timothy I. Cannings and Richard J. Samworth
Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959–1035
Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 10, B2 = 10, base = "LDA", projmethod = "Haar", estmethod = "training", cores = 1) alpha <- RPalpha(RP.out = Out, Y = Train$y, p1 = sum(Train$y == 1)/length(Train$y)) alpha
Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 10, B2 = 10, base = "LDA", projmethod = "Haar", estmethod = "training", cores = 1) alpha <- RPalpha(RP.out = Out, Y = Train$y, p1 = sum(Train$y == 1)/length(Train$y)) alpha
Chooses a the best projection from a set of size B2
based on a test error estimate, then classifies the training and test sets using the chosen projection.
RPChoose(XTrain, YTrain, XTest, d, B2 = 10, base = "LDA", k = c(3,5), projmethod = "Haar", estmethod = "training", ...)
RPChoose(XTrain, YTrain, XTest, d, B2 = 10, base = "LDA", k = c(3,5), projmethod = "Haar", estmethod = "training", ...)
XTrain |
An |
YTrain |
A vector of length |
XTest |
An |
d |
The lower dimension of the image space of the projections |
B2 |
The block size |
base |
The base classifier one of |
k |
The options for |
projmethod |
Either |
estmethod |
Method for estimating the test errors to choose the projection: either training error |
... |
Optional further arguments if |
Randomly projects the the data B2
times. Chooses the projection yielding the smallest estimate of the test error. Classifies the training set (via the same method as estmethod
) and test set using the chosen projection.
Returns a vector of length n + n.test
: the first n
entries are the estimated classes of the training set, the last n.test
are the estimated classes of the test set.
Resubstitution method unsuitable for the k
-nearest neighbour classifier.
Timothy I. Cannings and Richard J. Samworth
Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959–1035
RPParallel
, RPChooseSS
, lda
, qda
, knn
set.seed(100) Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Choose.out5 <- RPChoose(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B2 = 5, base = "QDA", projmethod = "Haar", estmethod = "loo") Choose.out10 <- RPChoose(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B2 = 10, base = "QDA", projmethod = "Haar", estmethod = "loo") sum(Choose.out5[1:50] != Train$y) sum(Choose.out10[1:50] != Train$y) sum(Choose.out5[51:150] != Test$y) sum(Choose.out10[51:150] != Test$y)
set.seed(100) Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Choose.out5 <- RPChoose(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B2 = 5, base = "QDA", projmethod = "Haar", estmethod = "loo") Choose.out10 <- RPChoose(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B2 = 10, base = "QDA", projmethod = "Haar", estmethod = "loo") sum(Choose.out5[1:50] != Train$y) sum(Choose.out10[1:50] != Train$y) sum(Choose.out5[51:150] != Test$y) sum(Choose.out10[51:150] != Test$y)
RPChoose
Chooses the best projection based on an estimate of the
test error of the classifier with training data (XTrain, YTrain)
, the estimation method counts the number of errors made on the validation set (XVal, YVal)
.
RPChooseSS(XTrain, YTrain, XVal, YVal, XTest, d, B2 = 100, base = "LDA", k = c(3, 5), projmethod = "Haar", ...)
RPChooseSS(XTrain, YTrain, XVal, YVal, XTest, d, B2 = 100, base = "LDA", k = c(3, 5), projmethod = "Haar", ...)
XTrain |
An |
YTrain |
A vector of length |
XVal |
An |
YVal |
A vector of length |
XTest |
An |
d |
The lower dimension of the image space of the projections |
B2 |
The block size |
base |
The base classifier one of |
k |
The options for |
projmethod |
Either |
... |
Optional further arguments if |
Maps the the data using B2
random projections. For each projection the validation set is classified using the the training set and the projection yielding the smallest number of errors over the validation set is retained. The validation set and test set are then classified using the chosen projection.
Returns a vector of length n.val + n.test
: the first n.val
entries are the estimated classes of the validation set, the last n.test
are the estimated classes of the test set.
Timothy I. Cannings and Richard J. Samworth
Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959–1035
RPParallel
, RPChoose
, lda
, qda
, knn
set.seed(100) Train <- RPModel(1, 50, 100, 0.5) Validate <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Choose.out5 <- RPChooseSS(XTrain = Train$x, YTrain = Train$y, XVal = Validate$x, YVal = Validate$y, XTest = Test$x, d = 2, B2 = 5, base = "QDA", projmethod = "Haar") Choose.out10 <- RPChooseSS(XTrain = Train$x, YTrain = Train$y, XVal = Validate$x, YVal = Validate$y, XTest = Test$x, d = 2, B2 = 10, base = "QDA", projmethod = "Haar") sum(Choose.out5[1:50] != Validate$y) sum(Choose.out10[1:50] != Validate$y) sum(Choose.out5[51:150] != Test$y) sum(Choose.out10[51:150] != Test$y)
set.seed(100) Train <- RPModel(1, 50, 100, 0.5) Validate <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Choose.out5 <- RPChooseSS(XTrain = Train$x, YTrain = Train$y, XVal = Validate$x, YVal = Validate$y, XTest = Test$x, d = 2, B2 = 5, base = "QDA", projmethod = "Haar") Choose.out10 <- RPChooseSS(XTrain = Train$x, YTrain = Train$y, XVal = Validate$x, YVal = Validate$y, XTest = Test$x, d = 2, B2 = 10, base = "QDA", projmethod = "Haar") sum(Choose.out5[1:50] != Validate$y) sum(Choose.out10[1:50] != Validate$y) sum(Choose.out5[51:150] != Test$y) sum(Choose.out10[51:150] != Test$y)
Performs a biased majority vote over B1
base classifications to assign the test set.
RPEnsembleClass(RP.out, n , n.val, n.test, p1, samplesplit, alpha, ...)
RPEnsembleClass(RP.out, n , n.val, n.test, p1, samplesplit, alpha, ...)
RP.out |
The result of a call to |
n |
Training set sample size |
n.test |
Test set sample size |
n.val |
Validation set sample size |
p1 |
Prior probability estimate |
samplesplit |
|
alpha |
The voting threshold |
... |
Optional further arguments if |
An observation in the test set is assigned to class 1 if B1*alpha
or more of the base classifications are class 1 (otherwise class 2).
A vector of length n.test
containing the class predictions of the test set (either 1 or 2).
Timothy I. Cannings and Richard J. Samworth
Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959–1035
Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 50, B2 = 10, base = "LDA", projmethod = "Haar", estmethod = "training", clustertype = "Default") Class <- RPEnsembleClass(RP.out = Out, n = length(Train$y), n.test = nrow(Test$x), p1 = sum(Train$y == 1)/length(Train$y), splitsample = FALSE, alpha = RPalpha(Out, Y = Train$y, p1 = sum(Train$y == 1)/length(Train$y))) mean(Class != Test$y)
Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 50, B2 = 10, base = "LDA", projmethod = "Haar", estmethod = "training", clustertype = "Default") Class <- RPEnsembleClass(RP.out = Out, n = length(Train$y), n.test = nrow(Test$x), p1 = sum(Train$y == 1)/length(Train$y), splitsample = FALSE, alpha = RPalpha(Out, Y = Train$y, p1 = sum(Train$y == 1)/length(Train$y))) mean(Class != Test$y)
Generates B2
random p
by d
matrices according to Haar measure, Gaussian or axis-aligned projections
RPGenerate(p = 100, d = 10, method = "Haar", B2 = 10)
RPGenerate(p = 100, d = 10, method = "Haar", B2 = 10)
p |
The original data dimension |
d |
The lower dimension |
method |
Projection distribution, either |
B2 |
the number of projections |
returns B2
p
by d
random matrices as a single p
by d*B2
matrix
Timothy I. Cannings and Richard J. Samworth
Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959–1035
R1 <- RPGenerate(p = 20, d = 2, "Haar", B2 = 3) t(R1)%*%R1 R2 <- RPGenerate(p = 20, d = 2, "Gaussian", B2 = 3) t(R2)%*%R2 R3 <- RPGenerate(p = 20, d = 2, "axis", B2 = 3) colSums(R3) rowSums(R3)
R1 <- RPGenerate(p = 20, d = 2, "Haar", B2 = 3) t(R1)%*%R1 R2 <- RPGenerate(p = 20, d = 2, "Gaussian", B2 = 3) t(R2)%*%R2 R3 <- RPGenerate(p = 20, d = 2, "axis", B2 = 3) colSums(R3) rowSums(R3)
(x,y)
from joint distributionGenerates data from the models described in Cannings and Samworth (2017)
RPModel(Model.No, n, p, Pi = 1/2)
RPModel(Model.No, n, p, Pi = 1/2)
Model.No |
Model Number |
n |
Sample size |
p |
Data dimension |
Pi |
Class one prior probability |
x |
An |
y |
A vector of length |
Models 1 and 2 require p = 100
or 1000
.
Timothy I. Cannings and Richard J. Samworth
Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959–1035
Data <- RPModel(Model.No = 1, 100, 100, Pi = 1/2) table(Data$y) colMeans(Data$x[Data$y==1,]) colMeans(Data$x[Data$y==2,])
Data <- RPModel(Model.No = 1, 100, 100, Pi = 1/2) table(Data$y) colMeans(Data$x[Data$y==1,]) colMeans(Data$x[Data$y==2,])
Makes B1
calls to RPChoose
or RPChooseSS
in parallel and returns the results as a matrix.
RPParallel(XTrain, YTrain, XVal, YVal, XTest, d, B1 = 500, B2 = 50, base = "LDA",projmethod = "Gaussian", estmethod = "training", k = c(3,5,9), clustertype = "Default", cores = 1, machines = NULL, seed = 1, ... )
RPParallel(XTrain, YTrain, XVal, YVal, XTest, d, B1 = 500, B2 = 50, base = "LDA",projmethod = "Gaussian", estmethod = "training", k = c(3,5,9), clustertype = "Default", cores = 1, machines = NULL, seed = 1, ... )
XTrain |
An |
YTrain |
A vector of length |
XVal |
An |
YVal |
A vector of length |
XTest |
An |
d |
The lower dimension of the image space of the projections |
B1 |
The number of blocks |
B2 |
The size of each block |
base |
The base classifier one of |
k |
The options for |
projmethod |
|
estmethod |
Method for estimating the test errors to choose the projection: either training error |
clustertype |
The type of cluster: |
cores |
Required only if |
machines |
Required only if |
seed |
If not |
... |
Optional further arguments if |
Makes B1
calls to RPChoose
or RPChooseSS
in parallel.
If estmethod == "training"
or "loo"
, then returns an n+n.test
by B1
matrix, each row containing the result of a call to RPChoose
. If estmethod == "samplesplit"
, then returns an n.val+n.test
by B1
matrix, each row containing the result of a call to RPChooseSS
.
Timothy I. Cannings and Richard J. Samworth
Cannings, T. I. and Samworth, R. J. (2017) Random-projection ensemble classification, J. Roy. Statist. Soc., Ser. B. (with discussion), 79, 959–1035
Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 10, B2 = 10, base = "LDA", projmethod = "Haar", estmethod = "training") colMeans(Out)
Train <- RPModel(1, 50, 100, 0.5) Test <- RPModel(1, 100, 100, 0.5) Out <- RPParallel(XTrain = Train$x, YTrain = Train$y, XTest = Test$x, d = 2, B1 = 10, B2 = 10, base = "LDA", projmethod = "Haar", estmethod = "training") colMeans(Out)