I just wrapped up Google Summer of Code for the last time (at least as a student). Most of the times it was playing catch up , learning while doing and hacking till it works. This surely takes more time then formal learning but it definitely is more fun.
Most of the work involves MLlib and is minor considering the development going on at breakneck speed, but it feels good to have made a few changes. This blog post is meant to be a short write up of the work accomplished during the summary and NOT a technical description of the same.
- Python API for streaming algorithms
Online algorithms like k-means, linear and logistic regression (with gradient descent) cater to the needs of streaming data, where the model can be stored (eg weights, centroids) and be fit on incoming data. The API is very similar to the partial_fit method of scikit-learn except that the input data in sklearn is array-type. while those in MLlib involve dstreams. These are the Pull Requests.
- Model save / load
Various models in MLlib (both the scala backend and Python API) lack utilities for saving and loading. While saving, the metadata is stored in json format while the data is stored as a parquet file format which can be read back as a dataframe while loading. Most of the hard work had already been done by Joseph Bradley on the initial PR and mine was just followup work.
https://github.com/apache/spark/pull/7617 (Python save /load for GMM’s)
https://github.com/apache/spark/pull/7587 (Python save / load for LDA)
https://github.com/apache/spark/pull/4986 (Scala save / load for GMM’s)
https://github.com/apache/spark/pull/6948 (Scala save / load for LDA)
https://github.com/apache/spark/pull/5291 (Save / load for Word2Vec )
- Add API for missing models in Python
I added the Python API for Kernel Density and ElementWise Product and the Kolmogorov-Smirnov test (to show how confident one can be that the data shown is from a given distribution)
https://github.com/apache/spark/pull/6346 (Kernel Density)
https://github.com/apache/spark/pull/6387 (ElementWise Product)
https://github.com/apache/spark/pull/7430 (Kolmogorov-Smirnov test)
- Distributed linear algebra
I helped review the addition of distributed linear algebra, (i.e RowMatrix, CoordinateMatrix and IndexedRowMatrix) and added wrappers around PCA and SVD (which has not been merged yet)
- Linear Algebra
I helped a bit in the optimization of operations involving SparseMatrices, helped porting MatrixUDT which makes sure that Matrices can be used directly in pyspark dataframes and pretty printing
https://github.com/apache/spark/pull/5946 (Optimize dot products and squared_distances)
https://github.com/apache/spark/pull/6904 (Addition of numNonZeros and numActives)
https://github.com/apache/spark/pull/6579 (Making version checking robust)
https://github.com/apache/spark/pull/6354 (Porting MatrixUDT to PySpark)
https://github.com/apache/spark/pull/6342 (Pretty Printing of Matrices)
https://github.com/apache/spark/pull/7854 (Optimize SparseVector initializations)
- Implement LogisticRegression summary
Summaries are designed to make the user have quick access to how good the model has performed on the training and test data with little rewrite of code as possible. Here also the hard work had been done by Feynman Liang and my goal was to follow a similar API for LogisticRegression
- Add missing methods to models
While the basic Python models for the Pipeline API had already been written, wrappers had to be written so that the model attributes could be accessed. These are fairly straightforward to write.
https://github.com/apache/spark/pull/7263 (Word2Vec Python API)
https://github.com/apache/spark/pull/7095 (Word2Vec ML)
- Spark Infra changes
I wrote a couple of bash scripts (my first couple) to automate the conversion of since javadocs to Since annotations and pylint checks to accompany the pep8 checks that are already present.
https://github.com/apache/spark/pull/7241 (Adding Pylint Checks)
- Miscellaneous bug fixes
I fixed a few bugs while trying to study the scala code myself. Some of them are
https://github.com/apache/spark/pull/6383 (wrong normalization of distributions in KernelDensity)
https://github.com/apache/spark/pull/6720 (remove unnecessary construct in StreamingAlgorithms)
https://github.com/apache/spark/pull/6497 (wrong decayFactor set in StreamingKMeans)
- Add Java compatablity
I worked to fix a few methods which could not be called from Java because either the arguments in the method call or the return type itself could not be called from Java. (eg those with RDD’s)
- User guides
I updated the user guides for using Evaluators with CrossValidation and ParamGrid (this is similar to the GridSearch in sklearn) and for the LogisticRegressionSummary but these have not be merged yet.
I would like to thank Xiangrui Meng, Joseph Bradley, Davies Liu, Feynman Liang for all the time spent on the reviews and guiding me on the tickets. I still think I am an okayish programmer with a very high level idea of various ML algorithms l but hopefully I’m on the road to changing that. 🙂