Recently I wrote a couple posts about eigenvectors and eigenvalues. I thought it would be cool to go from that slightly more theoretical material and show something super useful in which eigenvectors / eigenvalues are an integral part. So today, I’m going to talk about Principal Component Analysis. This algorithm is used everywhere, notably in neuroscience and computer graphics. The idea is you have a dataset with a ton of features and you want to reduce it to it’s core components. With high dimensionality, not only is the curse of dimensionality a problem, but you also just can’t visualize the data, which prevents a lot of basic insights. So we want to reduce the dimensionality without losing vital information. This is where PCA comes in. In PCA, we go from a large number of features to a small number of components, which still contain a sufficient proportion of the information. Before we talk about the algorithm itself, there are a few important math concepts which you must be familiar with in order to proceed. The first is eigenvalues and eigenvectors. You can read about them here. The second is the covariance matrix. Covariance Matrix Covariance is the measure of how two different variables change together. The covariance between two variables, X and Y, can be given by the following formula. $$cov = \frac{\sum\limits_{i=1}^{n} (X_i – \bar{X})(Y_i – \bar{Y})}{(n-1)}$$ Now, if we wanted to look at all the possible covariances in a dataset, we can compute the covariance matrix, which has this form – $$C = \left( \begin{array}{ccc} cov(x,x) & cov(x,y) & cov(x,z) \\ cov(y,x) & cov(y,y) & cov(y,z) \\ cov(z,x) & cov(z,y) & cov(z,z) \end{array} \right)$$ Notice that this matrix will be symmetric \((A = A^T)\), and will have a diagonal of just variances, because \(cov(x, x)\) is the same thing as the variance of x. If you understand the covariance matrix and eigenvalues/vectors, you’re ready to learn about PCA. Principal Component Analysis Here is the central idea of PCA – the components are the eigenvectors of the covariance matrix. Moreover, the amount of variance each component captures can be represented by the magnitude of its corresponding eigenvalue. Read that multiple times. So basically, PCA can be represented in 4 pretty simple steps. 1. Normalize the data. We’re dealing with covariance, so it’s a good idea to have features on the same scale. 2. Calculate the covariance matrix. 3. Find the eigenvectors of the covariance matrix. 4. Translate the data to…

Read More## Linear Algebra in Julia

Most people (including myself) are drawn to Julia by its lofty goals. Speed of C, statistical packages of R, and ease of Python?—it sounds two good to be true. However, I haven’t seen anyone who has looked into it say the developers behind the language aren’t on track to accomplish these goals. Having only been around since 2012, Julia’s greatest disadvantage is a lack of community support. If you have an obscure Julia question and you google it, you probably won’t find the answer, whereas with Python or R or Java you would. This also means less package support. The packages for linear algebra, plotting, and other stuff are there, but if you want to do computer vision or nlp, you’d be among the few. However, it is definitely worth looking into. I’m not quite a pro, but I’m getting to the point where if I code something in Python, I can easily transfer it to Julia. Then sometimes I test the speed for fun. Julia always wins. Recently, I wrote an article about linear algebra, with accompanying code in Python. Below is basically the same article, with the code in Julia. If you need help with the basic syntax, I also wrote a basic syntax guide, kind of a compressed version of the documentation. I included the concept descriptions, but they’re no different from my original article. The point of this is just to show how easy it is to do linear algebra in Julia. Here is the iPython notebook on my github. (You can write Julia code in iPython…it’s awesome). The Basics matrix – a rectangular array of values vector – one dimensional matrix identity matrix I – a diagonal matrix is an n x n matrix with one’s on the diagonal from the top left to the bottom right. i.e. [[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]] When a matrix A is multiplied by it’s inverse A^-1, the result is the identity matrix I. Only square matrices have inverses. Example below. Note – the inverse of a matrix is not the transpose. Matrices are notated m x n, or rows x columns. A 2×3 matrix has 2 rows and 3 columns. Read this multiple times. You can only add matrices of the same dimensions. You can only multiply two matrices if the first is m x n, and the second is n x p. The n-dimension has to match. Now the…

Read More## Random Forest in scikit-learn

Random Forest is used for classification, regression, and feature selection. It is an ensemble technique, meaning it combines the output of one weaker technique in order to get a stronger result. The weaker technique in this case is a decision tree. Decision trees work by splitting the and re-splitting the data by features. If a decision tree is split along good features, it can give a decent predictive output. Random Forest works by averaging decision tree output, but it’s a bit more complicated than that. It also ranks an individual tree’s output, by comparing it to the known output from the training data. This allows it to rank features. Some of the decision trees will perform better, and so the features within the tree will be deemed more important. Classification and regression would be the actual output of the model. A good RF (meaning one that generalizes well) will have higher accuracy by each tree, and higher diversity among its trees. One downfall of random forest is it can fail with higher dimensional data, because the trees will often be split by less relevant features. If you’re still intrigued by random forest, I encourage you to research more on your own! It gets a lot more mathematical. Now, let’s implement one in Python. We will be using the famous Iris Dataset, collected in the 1930’s by Edgar Anderson. In this example, we are going to train a random forest classification algorithm to predict the class in the test data. I’ve split the data so each class is represented correctly. Note – you must have scikit-learn, pandas, numPy, and sciPy installed for this tutorial. You can install them all easily using pip (‘pip install sciPy’). You can also download anacondas. The data for this tutorial is on my github, as well as the iPython notebook. In [2]:

1 2 3 4 5 6 7 8 |
# First let's import the dataset, using Pandas. import pandas as pd # make sure you're in the right directory if using iPython! train = pd.read_csv("train.csv") test = pd.read_csv("test.csv") train.head() |

Out[2]: class petal_length petal_width sepal_length sepal_width 0 Iris-virginica 5.5 1.8 6.4 3.1 1 Iris-virginica 5.9 2.3 6.8 3.2 2 Iris-virginica 5.4 2.3 6.2 3.4 3 Iris-virginica 4.8 1.8 6.0 3.0 4 Iris-virginica 5.1 2.3 6.9 3.1 In [3]:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# if you don't have the packages installed for this, # you will get an error. from sklearn.ensemble import RandomForestClassifier # the data have to be in a numpy array in order for # the random forest algorithm to except it! cols = ['petal_length', 'petal_width', 'sepal_length', 'sepal_width'] colsRes = ['class'] trainArr = train.as_matrix(cols) # training array trainRes = train.as_matrix(colsRes) # training results ## Training! rf = RandomForestClassifier(n_estimators=100) # initialize rf.fit(trainArr, trainRes) # fit the data to the algorithm # note - you might get an warning saying you entered a 2 column vector..ignore it. If you know how to get around this warning, # please comment! The algorithm seems to work anyway. |

Out[3]:

1 2 3 4 5 6 |
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False) |

In [5]:

1 2 3 4 5 6 7 8 9 10 11 12 |
## Testing! # put the test data in the same format! testArr = test.as_matrix(cols) results = rf.predict(testArr) # something I like to do is to add it back to the dataframe, # so I can compare side-by-side test['predictions'] = results test.head() # note - the column names shifted. Just ignore that. |

Out[5]: class petal_length petal_width sepal_length sepal_width predictions 0 Iris-virginica 6.6 2.1 7.6 3.0 Iris-virginica 1 Iris-virginica 6.3 1.8 7.3 2.9 Iris-virginica 2 Iris-virginica 5.5 2.1 6.8 3.0 Iris-virginica 3 Iris-virginica 5.1 2.4 5.8 2.8 Iris-virginica 4 Iris-virginica 5.3 2.3 6.4 3.2 Iris-virginica With this dataset, the random forest algorithm predicted class perfectly. That is unlikely…

Read More## Basic Data Exploration in R

When you’re cleaning up data, you usually end up using a 5-8 functions a ton of times, and then a few more once or twice. Here are those 5-8 functions I find myself using again and again. Here is a quick overview: names() – returns the column names of a dataset str() – gives the overview of a dataset data.table package – includes functions for creating new columns, among other things %in% operator – checks if a value is in a vector Below are some examples.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
names(rock) # returns the column names [1] "area" "peri" "shape" "perm" str(rock) # gives the format of the dataframe 'data.frame': 48 obs. of 4 variables: $ area : int 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ... $ peri : num 2792 3893 3931 3869 3949 ... $ shape: num 0.0903 0.1486 0.1833 0.1171 0.1224 ... $ perm : num 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ... # import the data.table package install.packages("data.table") # don't forget these 3 steps! library(data.table) dtRock <- data.table(rock) dtRock[1:5] # returns the first 5 columns area peri shape perm 1: 4990 2791.90 0.0903296 6.3 2: 7002 3892.60 0.1486220 6.3 3: 7558 3930.66 0.1833120 6.3 4: 7352 3869.32 0.1170630 6.3 5: 7943 3948.54 0.1224170 17.1 # and my favorite way to create a new column dtRock[, areaMP := area / 1000] # area is measured in pixels, so areaMP # is in mega pixels dtRock[1, ] # indicates the first row, all columns area peri shape perm areaMP 1: 4990 2791.9 0.0903296 6.3 4.99 dtRock[, 'areaMP'] # returns the entire 'areaMP' column # The %in% operator is one of the most useful functions in R, I think. a <- c(1, 2, 3, 4) 4 %in% a # it's asking, is the value 4 in the vector a? [1] TRUE |

There are many other functions and packages, such as the ‘dplyr’ package by the amazing Hadley Wickham, but I am just showing the ones I use most frequently.

Read More