time.1 <- Sys.time()
format(time.1, "%Y-%m-%d-%H%M%S")
## [1] "2015-02-16-102701"
format(Sys.time(), "%Y-%m-%d-%H%M%S")
## [1] "2015-02-16-102701"
load("plankton-train-wndchrm-features.Rdata", verbose=TRUE)
## Loading objects:
## feature.matrix
format(Sys.time(), "%Y-%m-%d-%H%M%S")
## [1] "2015-02-16-102705"
feature.matrix is a matrix of doubles for storage efficiency – much better than a data.frame.
class(feature.matrix)
## [1] "matrix"
typeof(feature.matrix)
## [1] "double"
Create z values:
\[z = \frac{x - \mu_x}{\sigma_x}\]
format(Sys.time(), "%Y-%m-%d-%H%M%S")
## [1] "2015-02-16-102705"
plankton <- feature.matrix[,1] # plankton class
scaledTrain <- scale(feature.matrix[, -1]) # exclude class variable in first column
format(Sys.time(), "%Y-%m-%d-%H%M%S")
## [1] "2015-02-16-102727"
dim(scaledTrain)
## [1] 30336 2894
save(plankton, scaledTrain, file="plankton-train-wndchrm-scaled.Rdata")
format(Sys.time(), "%Y-%m-%d-%H%M%S")
## [1] "2015-02-16-102752"
svdTrain <- svd(scaledTrain)
format(Sys.time(), "%Y-%m-%d-%H%M%S")
## [1] "2015-02-16-104021"
Let’s inspect sizes of the three SVD components, u, d, and v.
(u and v are unitary rotation matrices and d is the diagonal of a scaling matrix.)
Note: Original matrix can be constructed from SVD decomposition:
svdTrain$u %*% diag(svdTrain$d) %*% t(svdTrain$v)
dim(svdTrain$u)
## [1] 30336 2894
length(svdTrain$d)
## [1] 2894
dim(svdTrain$v)
## [1] 2894 2894
save(svdTrain, file="plankton-train-wndchrm-svd.Rdata")
Let’s compute PCs separately with prcomp – but PCA and SVD results are related.
format(Sys.time(), "%Y-%m-%d-%H%M%S")
## [1] "2015-02-16-104056"
pcaTrain <- prcomp(scaledTrain) # If not already scaled, use scale.=TRUE
format(Sys.time(), "%Y-%m-%d-%H%M%S")
## [1] "2015-02-16-105728"
save(pcaTrain, file="plankton-train-wndchrm-pca.Rdata")
Let’s look at first 75 features associated with largest 75 eigenvalues …
Neigen <- 75
plot(svdTrain$d[1:Neigen]^2 / sum(svdTrain$d^2),pch=19,
main=paste("Features associated with first", Neigen, "eigenvalues"),
xlab="Eigenvalue rank",
ylab="Fraction of variance explained by feature")
grid()
mtext(paste0(nrow(scaledTrain), " rows, ", ncol(scaledTrain), " variables"))
plot(cumsum(svdTrain$d[1:Neigen]^2 / sum(svdTrain$d^2)),pch=19,
main=paste("First", Neigen, "eigenvalues"),
xlab="Eigenvalue rank",
ylab="Cumulative fraction of variance explained")
grid()
The plot above shows that first 75 features explain about 65% of the variance.
Features associated with the first Neigen eigenvalues explain large share of variance.
boxplot(svdTrain$v[,1:Neigen],
xlab="Eigenvalue rank",
main="Boxplots of right singular vector weights")
mtext(paste("First", Neigen, "right singular vectors"))
par(mfrow=c(2,2))
smoothScatter(svdTrain$v[,1], main="Right singular vector v1")
smoothScatter(svdTrain$v[,2], main="Right singular vector v2")
smoothScatter(svdTrain$v[,3], main="Right singular vector v3")
par(mfrow=c(2,2))
smoothScatter(svdTrain$v[,1], svdTrain$v[,2])
smoothScatter(svdTrain$v[,1], svdTrain$v[,3])
smoothScatter(svdTrain$v[,2], svdTrain$v[,3])
Show PCA results are the same as right singular vectors when SVD is done with scaled data.
par(mfrow=c(2,2))
smoothScatter(pcaTrain$rotation[,1], pcaTrain$rotation[,2])
smoothScatter(pcaTrain$rotation[,1], pcaTrain$rotation[,3])
smoothScatter(pcaTrain$rotation[,2], pcaTrain$rotation[,3])
The first column of feature.matrix gives the index of the 121 plankton classes in the training data. This column is converted to a factor for displaying the plankton in different colors – but with so many classes, they don’t all have unique colors.
library(rgl)
library(pca3d)
pca3d(pcaTrain, group=as.factor(feature.matrix[,1]))
##
## Legend:
## ----------------------------------------
## group: color, shape
## ----------------------------------------
## 1: red, tetrahaedron
## 2: green3, cube
## 3: blue, sphere
## 4: cyan, tetrahaedron
## 5: magenta, cube
## 6: yellow, sphere
## 7: gray, tetrahaedron
## 8: black, cube
## 9: red, sphere
## 10: green3, tetrahaedron
## 11: blue, cube
## 12: cyan, sphere
## 13: magenta, tetrahaedron
## 14: yellow, cube
## 15: gray, sphere
## 16: black, tetrahaedron
## 17: red, cube
## 18: green3, sphere
## 19: blue, tetrahaedron
## 20: cyan, cube
## 21: magenta, sphere
## 22: yellow, tetrahaedron
## 23: gray, cube
## 24: black, sphere
## 25: red, tetrahaedron
## 26: green3, cube
## 27: blue, sphere
## 28: cyan, tetrahaedron
## 29: magenta, cube
## 30: yellow, sphere
## 31: gray, tetrahaedron
## 32: black, cube
## 33: red, sphere
## 34: green3, tetrahaedron
## 35: blue, cube
## 36: cyan, sphere
## 37: magenta, tetrahaedron
## 38: yellow, cube
## 39: gray, sphere
## 40: black, tetrahaedron
## 41: red, cube
## 42: green3, sphere
## 43: blue, tetrahaedron
## 44: cyan, cube
## 45: magenta, sphere
## 46: yellow, tetrahaedron
## 47: gray, cube
## 48: black, sphere
## 49: red, tetrahaedron
## 50: green3, cube
## 51: blue, sphere
## 52: cyan, tetrahaedron
## 53: magenta, cube
## 54: yellow, sphere
## 55: gray, tetrahaedron
## 56: black, cube
## 57: red, sphere
## 58: green3, tetrahaedron
## 59: blue, cube
## 60: cyan, sphere
## 61: magenta, tetrahaedron
## 62: yellow, cube
## 63: gray, sphere
## 64: black, tetrahaedron
## 65: red, cube
## 66: green3, sphere
## 67: blue, tetrahaedron
## 68: cyan, cube
## 69: magenta, sphere
## 70: yellow, tetrahaedron
## 71: gray, cube
## 72: black, sphere
## 73: red, tetrahaedron
## 74: green3, cube
## 75: blue, sphere
## 76: cyan, tetrahaedron
## 77: magenta, cube
## 78: yellow, sphere
## 79: gray, tetrahaedron
## 80: black, cube
## 81: red, sphere
## 82: green3, tetrahaedron
## 83: blue, cube
## 84: cyan, sphere
## 85: magenta, tetrahaedron
## 86: yellow, cube
## 87: gray, sphere
## 88: black, tetrahaedron
## 89: red, cube
## 90: green3, sphere
## 91: blue, tetrahaedron
## 92: cyan, cube
## 93: magenta, sphere
## 94: yellow, tetrahaedron
## 95: gray, cube
## 96: black, sphere
## 97: red, tetrahaedron
## 98: green3, cube
## 99: blue, sphere
## 100: cyan, tetrahaedron
## 101: magenta, cube
## 102: yellow, sphere
## 103: gray, tetrahaedron
## 104: black, cube
## 105: red, sphere
## 106: green3, tetrahaedron
## 107: blue, cube
## 108: cyan, sphere
## 109: magenta, tetrahaedron
## 110: yellow, cube
## 111: gray, sphere
## 112: black, tetrahaedron
## 113: red, cube
## 114: green3, sphere
## 115: blue, tetrahaedron
## 116: cyan, cube
## 117: magenta, sphere
## 118: yellow, tetrahaedron
## 119: gray, cube
## 120: black, sphere
## 121: red, tetrahaedron
##
## Creating new device
There’s a fairly long delay (a minute or two) before the interactive display appears.
Shown below are three snapshots from interactive views of the first three principal components:
time.2 <- Sys.time()
cat(sprintf("%.1f", as.numeric(difftime(time.2, time.1, units="secs"))), " secs\n")
## 1928.6 secs
efg @EarlGlynn
2015-02-16 1059