Google Summer of Code – Wrapup

I just wrapped up Google Summer of Code for the last time (at least as a student). Most of the times it was playing catch up , learning while doing and hacking till it works. This surely takes more time then formal learning but it definitely is more fun.

Most of the work involves MLlib and is minor considering the development going on at breakneck speed, but it feels good to have made a few changes. This blog post is meant to be a short write up of the work accomplished during the summary and NOT a technical description of the same.

  • Python API for streaming algorithms

Online algorithms like k-means, linear and logistic regression (with gradient descent) cater to the needs of streaming data, where the model can be stored (eg weights, centroids) and be fit on incoming data. The API is very similar to the partial_fit method of scikit-learn except that the input data in sklearn is array-type. while those in MLlib involve dstreams. These are the Pull Requests.

https://github.com/apache/spark/pull/6499

https://github.com/apache/spark/pull/6744

https://github.com/apache/spark/pull/6849

  •   Model save / load

Various models in MLlib (both the scala backend and Python API) lack utilities for saving and loading. While saving, the metadata is stored in json format while the data is stored as a parquet file format which can be read back as a dataframe while loading. Most of the hard work had already been done by Joseph Bradley on the initial PR and mine was just followup work.

https://github.com/apache/spark/pull/7617 (Python save /load for GMM’s)

https://github.com/apache/spark/pull/7587 (Python save / load for LDA)

https://github.com/apache/spark/pull/4986 (Scala save / load for GMM’s)

https://github.com/apache/spark/pull/6948 (Scala save / load for LDA)

https://github.com/apache/spark/pull/5291 (Save / load for Word2Vec )

  •     Add API for missing models in Python

I added the Python API for Kernel Density and ElementWise Product and the Kolmogorov-Smirnov test (to show how confident one can be that the data shown is from a given distribution)

https://github.com/apache/spark/pull/6346 (Kernel Density)

https://github.com/apache/spark/pull/6387 (ElementWise Product)

https://github.com/apache/spark/pull/7430 (Kolmogorov-Smirnov test)

  • Distributed linear algebra

I helped review the addition of distributed linear algebra, (i.e RowMatrix, CoordinateMatrix and IndexedRowMatrix) and added wrappers around PCA and SVD (which has not been merged yet)

https://github.com/apache/spark/pull/7963

  • Linear Algebra

I helped a bit in the optimization of operations involving SparseMatrices, helped porting MatrixUDT which makes sure that Matrices can be used directly in pyspark dataframes and pretty printing

https://github.com/apache/spark/pull/5946 (Optimize dot products and squared_distances)

https://github.com/apache/spark/pull/6904 (Addition of numNonZeros and numActives)

https://github.com/apache/spark/pull/6579 (Making version checking robust)

https://github.com/apache/spark/pull/6354 (Porting MatrixUDT to PySpark)

https://github.com/apache/spark/pull/6342 (Pretty Printing of Matrices)

https://github.com/apache/spark/pull/7854 (Optimize SparseVector initializations)

  • Implement LogisticRegression summary

Summaries are designed to make the user have quick access to how good the model has performed on the training and test data with little rewrite of code as possible. Here also the hard work had been done by Feynman Liang and my goal was to follow a  similar API for LogisticRegression

https://github.com/apache/spark/pull/7538

  • Add missing methods to models

While the basic Python models for the Pipeline API had already been written, wrappers had to be written so that the model attributes could be accessed. These are fairly straightforward to write.

https://github.com/apache/spark/pull/7930 (TreeModels)

https://github.com/apache/spark/pull/7263 (Word2Vec Python API)

https://github.com/apache/spark/pull/7095 (Word2Vec ML)

https://github.com/apache/spark/pull/7086 (StandardScaler)

  • Spark Infra changes

I wrote a couple of bash scripts (my first couple) to automate the conversion of since javadocs to Since annotations and pylint checks to accompany the pep8 checks that are already present.

https://github.com/apache/spark/pull/7241 (Adding Pylint Checks)

https://github.com/apache/spark/pull/8352

  • Miscellaneous bug fixes

I fixed a few bugs while trying to study the scala code myself. Some of them are

https://github.com/apache/spark/pull/6383 (wrong normalization of distributions in KernelDensity)

https://github.com/apache/spark/pull/6720 (remove unnecessary construct in StreamingAlgorithms)

https://github.com/apache/spark/pull/6497 (wrong decayFactor set in StreamingKMeans)

  • Add Java compatablity

I worked to fix a few methods which could not be called from Java because either the arguments in the method call or the return type itself could not be called from Java. (eg those with RDD’s)

https://github.com/apache/spark/pull/8126

  • User guides

I updated the user guides for using Evaluators with CrossValidation and ParamGrid (this is similar to the GridSearch in sklearn) and for the LogisticRegressionSummary but these have not be merged yet.

I would like to thank Xiangrui Meng, Joseph Bradley, Davies Liu, Feynman Liang for all the time spent on the reviews and guiding me on the tickets. I still think I am an okayish programmer with a very high level idea of various ML algorithms l but hopefully I’m on the road to changing that. 🙂

Advertisements

6 comments

  1. Joseph Bradley · · Reply

    Hi Manoj, I just wanted to thank you for all of your work this summer. You are a bit too humble in your post. You’ve been a huge help as one of the most active MLlib contributors, and at the rate you learned, I’m sure you’ll be able to do great things. Speaking of doing great things, I hope you can continue to contribute in your spare time!

    1. Thanks a lot for your kind words. I will try to continue contributing to the project in my free time and keep learning. 🙂

  2. Congratulations Manoj!

    1. Thanks a lot, Fabian 🙂

  3. Manoj, congrats on finishing GSoC! Small pull requests added up over the summer to big contributions to Spark MLlib. I really enjoyed working with you, and I greatly appreciate the extra time you spent on code review and documentation. Looking forward to more pull requests and review comments from you:)

    1. Thanks. 🙂 Looking forward to working with you in the future as well! 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: