I was postponing the last post for the last of my Pull Requests to get merged. Now since it got merged, I do not have any reason to procrastinate. This is the work that I have done across summer, with a short description of each,

(Just in case you were wondering why the “another” in the title, https://manojbits.wordpress.com/2013/09/27/the-end-of-a-journey/ )

1. **Improved memory mangement in the coordinate descent code.**

**Status**: merged

**Pull Request**: https://github.com/scikit-learn/scikit-learn/pull/3102

Changing the backend from multiprocessing to threading by removing the GIL, and replacing the function calls with pure cblas. A huge improvement 3x – 4x in terms of memory was seen without compromising much on speed.

2. **Randomised coordinate descent**

**Status**: merged

**Pull Request**:https://github.com/scikit-learn/scikit-learn/pull/3335

Updating a feature randomnly with replacement instead of doing an update across all features can make descent converge quickly.

3. **Logistic Regression CV**

**Status**: merged

**Pull Request**: https://github.com/scikit-learn/scikit-learn/pull/2862

Fitting a cross validation path across a grid of Cs, with new solvers based on newton_cg and lbfgs. For high dimensional data, the warm start makes these solvers converge faster.

4. **Multinomial Logistic Regression**

**Status**: merged

**Pull Request**: https://github.com/scikit-learn/scikit-learn/pull/3490

Minimising the cross-entropy loss instead of doing a OvA across all classes. This results in better probability estimates of the predicted classes.

5. **Strong Rules for coordinate descent**

**Status**: Work in Progress

**Pull Request**: https://github.com/scikit-learn/scikit-learn/pull/3579

Rules which help skip over non-active features. I am working on this and it should be open for review in a few days.

Apart from these I have worked on a good number of minor bug fixes and enhancements, including exposing the n_iter parameter across all estimates, fixing incomplete download of newsgroup datasets, and soft coding the max_iter param in liblinear.

I would like to thank my mentor Alex who is the best mentor one can possibly have, (I’m not just saying this because of hope that he will pass me :P), Jaidev, Olivier, Vlad, Arnaud, Andreas, Joel, Lars, and the entire scikit-learn community for helping me to complete an important project to an extent of satisfaction. (It is amazing how people manage to contribute so much, inspite of having other full time jobs). I will be contributing to scikit-learn full-time till December at least as part of my internship.

EDIT: And of course Gael (how did I forget), the awesome project manager who is always full of enthusiasm and encouragement.

As they say one journey ends for the other to begin. The show must go on.

]]>Anyhow on a more positive note, recently one of my biggest pull requests got merged, ( https://github.com/scikit-learn/scikit-learn/pull/2862 ) and we shall have a quick look at the background, what it can do and what it cannot.

1. **What is Logistic Regression**?

A Logistic Regression is a regression model that uses the logistic sigmoid function to predict classification. The basic idea is to predict the feature vector sucht that it fits the Logistic_log function, . A quick look at the graph (taken from wikipedia), when is one, we need our estimator to predict to be infinity and vice versa.

Now if we want to fit labels [-1, 1] the sigmoid function becomes . The logistic loss function is given by, . Intuitively this seems correct because when y is 1, we need our estimator to predict to be infinity, to suffer zero loss. Similarly when y is -1, we need out estimator to predict to be -1. Our basic focus is to optimize for loss.

2. **How can this be done?**

This can be done either using block coordinate descent methods, like lightning does, or use the inbuilt solvers that scipy provides like newton_cg and lbfgs. For the newton-cg solver, we need the hessian, or more simply the double derivative matrix of the loss and for the lbfgs solver we need the gradient vector. If you are too lazy to do the math (like me?), you can obtain the values from here, Hessian

3. **Doesn’t scikit-learn have a Logistic Regression already**?

Oh well it does, but it is dependent on an external library called Liblinear. There are two major problems with this.

a] Warm start, (one cannot warm start, with liblinear since it does not have a coefficient parameter), unless we patch the shipped liblinear code.

b] Penalization of intercept. Penalization is done so that the estimator does not overfit the data, however the intercept is independent of the data (which can be considered analogous to a column of ones), and so it does not make much sense to penalize it.

4. **Things that I learnt**

Apart from adding a warm start, (there seems to be a sufficient gain in large datasets), and not penalizing the intercept,

a] refit paramter – generally after cross-validating, we take the average of the scores obtained across all folds, and the final fit is done according to the hyperparameter (in this case C) that corresponds to the perfect score. However Gael suggested that one could take the best hyperparameter across every fold (in terms of score) and average these coefficients and hyperparameters. This would prevent the final refit.

b] Parallel OvA – For each label, we perform a OvA, that is to convert into 1 for the label in question, and into -1’s for the other labels. There is a Parallel loop across al loops and folds, and this is to supposed to make it faster.

c] Class weight support: The easiest way to do it is to convert to per sample weight and multiply it to the loss for each sample. But we had faced a small problem with the following three conditions together, class weight dict, solver liblinear and a multiclass problem, since liblinear does not support sample weights.

5. **Problems that I faced.**

a] The fit intercept = True case is found out to be considerably slower than the fit_intercept=False case. Gaels hunch was that it was because the intercept varies differently as compared to the data, We tried different things, such as preconditioning the intercept, i.e dividing the initial coefficient with the square root of the diagonal of the Hessian, but it did not work and it took one and a half days of time.

b] Using liblinear as an optimiser or a solver for the OvA case.

i] If we use liblinear as a solver, it means supplying the multi-label problem directly to liblinear.train. This would affect the parallelism and we are not sure if liblinear internally works the same way as we think we do. So after a hectic day of refactoring code, we finally decided (sigh) using liblinear as an optimiser is better (i.e we convert the labels to 1 and -1). For more details about Gaels comment, you can have a look at this https://github.com/scikit-learn/scikit-learn/pull/2862#issuecomment-49450285

Phew, this was a long post and I’m not sure if I typed everything as I wanted to. This is what I plan to accomplish in the coming month

1. Finish work on Larsmans PR

2. Look at glmnet for further improvements in the cd_fast code.

3. ElasticNet regularisation from Lightning.

1. **Fixing precompute for ElasticNetCV**

The function argument precompute=”auto” was being unused, in ElasticNetCV as mentioned in my previous post. Setting precompute equals auto uses a gram variant of the input matrix which according to the documentation is **np.dot(X.T, X)** . This theoritically helps the descent algorithm to converge faster. (However at the time of writing, I do not know exactly how). Practically however, (and after testing using the line profiler) it seems to be a bit slower since the computation of the gram matrix takes quite a bit of time. So with the advice of ogrisel, I split it across three Pull Requests. All three are essentially easy to fix.

1. https://github.com/scikit-learn/scikit-learn/pull/3247 – This ensures that the Gram variant is used if precompute is said to True or (auto and if n_samples > n_features)

2. https://github.com/scikit-learn/scikit-learn/pull/3248 – Remove precompute from Multi Task models since it is being unused,

3. https://github.com/scikit-learn/scikit-learn/pull/3249 – This is a WIP, and changes default precompute from auto to False

2. **Threading backend for Linear Models.**

I have successfully changed the backend from muli-processing to threading after releasing the GIL for all four variants. After a final round of review it can be merged,

a] Simple coordinate descent

b] Sparse coordinate descent

c] Gram variant

d] MultiTask variant

There is a huge memory gain and speed is almost the same (if not slightly higher by this Pull Request). https://github.com/scikit-learn/scikit-learn/pull/3102

3. **Logistic Regression CV**

Reading someone else’s magically vectorised NumPy code isn’t an easy task and I somehow crawled my way through it (which explains the more productive first part).

I fixed a bug in the code to compute the Hessian when the intercept is true. I’ve also fixed sparse matrix support and added multiple tests to it and confirmted that the newton-cg and lbfgs solvers give the exact same result, The liblinear has a slight change due to the penalisation of the intercept.

However benchmarking gives ambiguous results. On standard datasets such as the newsgroup and digits data, almost always the lib-linear solver is the fastest. However in datasets using make_classification, lbfgs seems to be the faster solver.

Right now, my job is just to wait for comments from Alex and Olivier and make the necessary changes. I shall come up with a more detailed description on Log Reg CV next week.

]]>1. **Got the memory profiler working**.

The https://github.com/fabianp/memory_profiler is a wonderful tool bult by Fabian that gives a line by line coments about the memory being used. You can install it like any other python package, by doing

sudo python setup.py install

a] You can use it by simply importing it in the top of the file.

from memory_profiler import profile

and adding **@profile** at the top of the function, that you need to use it for.

As mentioned in the last post, this was giving ambiguous results, since it wasn’t taking into account the child processes. There are is a workaround to this, that is using the **mprof -c** command directly, which gives plots showing how much memory is being used up. (Thanks to Olivier for helping me out with this)

We can see that threading has a considerable advantage with respect to memory, this is because in threading, the memory of the input data, is shared by all the worker threads while in multiprocessing each process needs its own memory.

2. While I was trying to release the GIL for the Gram case, I found that the optimisation algorithm **cd_fast.enet_coordinate_descent_gram** wasn’t being used at all! I confirmed from Alexandre that it was indeed a refactoring bug, and so I’ve sent a Pull Request to fix this. https://github.com/scikit-learn/scikit-learn/pull/3220 .

The Gram Matrix which is actually **np.dot(X.T, X)** (and the coordinate_descent_gram) is found out in either of these two cases.

1. When precompute is set to be true.

2. When precompute is set to be “auto” and when n_features is greater than n_samples (default)

However, contrary to intution, it actually slows down (maybe because of the time taken to compute the Gram matrix) and hence I’ve changed the defualt to False in the PR.

3. **Started with the LogisticRegression CV PR**

I started reading the source code of the Logistic Regression CV Pull Request, and I could understand some of it. I pushed in a commit to Parallelize a computation involving a regularization parameter, but I found out that it actually slows things down (:-X).

By the way, thanks to the people at Rackspace Cloud for providing an account to help me with my benchmark sessions and a perfect opportunity to learn vim(!).

Things to do be done by next week.

1. Change the LogisticRegressionCV PR from a WIP to MRG.

2. Get the releasing GIL PR merged, with the help of other mentors.

P.S: For those who follow cricket, whenever I’m really slow in code, I take inspiration from M.S.Dhoni (the Indian Captian) who starts off really slow, but then finishes off in style every time (well almost).

]]>**Q: What is joblib.Parallel and why is it used?**

For computationally intensive numerical purposes, it always beneficial to splits the work among separate CPU cores, and the API makes it really easy to do so.

# Across one core list_ = [sqrt(i**2) for i in range(10)] # Across 4 cores. from sklearn.externals.joblib import Parallel, delayed # Parallel (n_jobs = x) (delayed(func)(arg) iter) list_ = Parallel(n_jobs=2)(delayed(sqrt)(i**2) for i in range(10))

In the fit method of ElasticNetCV, we can see the following lines.

jobs = (delayed(_path_residuals)(X, y, train, test, self.path, path_params, alphas=this_alphas, l1_ratio=this_l1_ratio, X_order='F', dtype=np.float64) for this_l1_ratio, this_alphas in zip(l1_ratios, alphas) for train, test in folds) mse_paths = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)(jobs)

Let us say the the number of folds is 10, alphas is 100, and n_l1_ratios is 3, then enet_coordinate_descent has to be run atleast 3000 times, where it does help a lot to split it across different cores. However, in the example that I have mentioned using for i in range(10), you can clearly see (if you use IPython magic %timeit), that the loop with Parallel runs much slower due to the overhead.

**Q: What is backend = “multiprocessing” or “threading” in Parallel?**

**Multiprocessing:**

Multiprocessing is the default backend of joblib. When this is used is there are multiple processes inclusing the parent process across different CPU cores. One must remember that there is memory duplication when the number of bytes is less than 1e6, and when it is greater that 1e6 the memory of the data given to Parallel is shared by all the child processes.

**Threading**

Threading is the more efficient of the two, both speed-wise and memory-wise (since the memory is shared between all input threads). Then why isn’t this being used? Well, it is because of the Global Interpreter Lock (GIL). The GIL prevents different threads from running simultaneously across different cores, which makes it less efficient. Each thread is said to acquire the GIL, when it runs and is said to release it, when it allows one of its fellow threads, the chance. For a more detailed overview of the GIL, please have a look at this PyCon talk. http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2010-understanding-the-python-gil-82-3273690 , This provides a disadvantage.

**Q: How to release the GIL to efficiently use threading?**

If your code is in Cython, you can efficiently release the GIL, by using a “with nogil” block, and replacing the Python / NumPy objects with raw cblas calls, For instance

# NumPy np.dot(x, y) # CBlas Assuming x and y have shape n cblas_ddot(n, <double*>&x[0], 1, <double*>&y[0], 1)

Just to help you visualize,

In my branch. releasing the GIL, helps the threads run concurrently across all four CPU’s

In master, you can see that not all the cores are up to their full performance,

**Q: All this sounds great, but what have you done this week?**

Most of this week was spent in replacing the NumPy calls with cblas calls and running benchmarks that I’ve dumped here. https://github.com/MechCoder/Sklearn_benchmarks . From my benchmarks it seems that threading has a slight advantage in speed over multiprocessing,

However the memory benchmarks using memory_profiler have been ambiguous. Thankfully Olivier, has sent a PR to resolve the issue, and next week I will be back with better updates. Cheers.

(Oh and I almost forgot, I helped in fixing this bug, superquick , https://github.com/scikit-learn/scikit-learn/pull/3178 )

]]>Thanks to Alex, Gael, Vlad and other mentors for helping me put together a decent proposal from almost nowhere. As part of the community bonding period, I am working on these two Pull Requests, hopiing I can get them merged by May 19th

1. https://github.com/scikit-learn/scikit-learn/pull/3087

2. https://github.com/scikit-learn/scikit-learn/pull/3102

Cheers to a more challenging and exciting summer with NumPy + Cython.

]]>