Releasing the GIL and coordinate descent

Hi, the first week of (another) Google Summer of Code has ended, and I just realised that it is not going to be a joyride. Most of this week was spent in benchmarking, and trying to complete (?) this PR and so I did not write much code. However I learnt quite a bit in my attempt to complete this PR https://github.com/scikit-learn/scikit-learn/pull/3102 .

Q: What is joblib.Parallel and why is it used?
For computationally intensive numerical purposes, it always beneficial to splits the work among separate CPU cores, and the API makes it really easy to do so.

# Across one core
list_ = [sqrt(i**2) for i in range(10)]

# Across 4 cores.
from sklearn.externals.joblib import Parallel, delayed

# Parallel (n_jobs = x) (delayed(func)(arg) iter)
list_ = Parallel(n_jobs=2)(delayed(sqrt)(i**2) for i in range(10))

In the fit method of ElasticNetCV, we can see the following lines.

jobs = (delayed(_path_residuals)(X, y, train, test, self.path,
                                 path_params, alphas=this_alphas,
                                 l1_ratio=this_l1_ratio, X_order='F',
                                 dtype=np.float64)
                for this_l1_ratio, this_alphas in zip(l1_ratios, alphas)
                for train, test in folds)
        mse_paths = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)(jobs)

Let us say the the number of folds is 10, alphas is 100, and n_l1_ratios is 3, then enet_coordinate_descent has to be run atleast 3000 times, where it does help a lot to split it across different cores. However, in the example that I have mentioned using for i in range(10), you can clearly see (if you use IPython magic %timeit), that the loop with Parallel runs much slower due to the overhead.

Q: What is backend = “multiprocessing” or “threading” in Parallel?

Multiprocessing:
Multiprocessing is the default backend of joblib. When this is used is there are multiple processes inclusing the parent process across different CPU cores. One must remember that there is memory duplication when the number of bytes is less than 1e6, and when it is greater that 1e6 the memory of the data given to Parallel is shared by all the child processes.

Threading
Threading is the more efficient of the two, both speed-wise and memory-wise (since the memory is shared between all input threads). Then why isn’t this being used? Well, it is because of the Global Interpreter Lock (GIL). The GIL prevents different threads from running simultaneously across different cores, which makes it less efficient. Each thread is said to acquire the GIL, when it runs and is said to release it, when it allows one of its fellow threads, the chance. For a more detailed overview of the GIL, please have a look at this PyCon talk. http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2010-understanding-the-python-gil-82-3273690 , This provides a disadvantage.

Q: How to release the GIL to efficiently use threading?
If your code is in Cython, you can efficiently release the GIL, by using a “with nogil” block, and replacing the Python / NumPy objects with raw cblas calls, For instance

# NumPy
np.dot(x, y)

# CBlas Assuming x and y have shape n
cblas_ddot(n, <double*>&x[0], 1, <double*>&y[0], 1)

Just to help you visualize,
In my branch. releasing the GIL, helps the threads run concurrently across all four CPU’s
Screenshot from 2014-05-20 21:59:54

In master, you can see that not all the cores are up to their full performance,

Screenshot from 2014-05-21 00:50:22

Q: All this sounds great, but what have you done this week?
Most of this week was spent in replacing the NumPy calls with cblas calls and running benchmarks that I’ve dumped here. https://github.com/MechCoder/Sklearn_benchmarks . From my benchmarks it seems that threading has a slight advantage in speed over multiprocessing, 77eb9eb4-e0b0-11e3-8b75-560fcc381251

However the memory benchmarks using memory_profiler have been ambiguous. Thankfully Olivier, has sent a PR to resolve the issue, and next week I will be back with better updates. Cheers.

(Oh and I almost forgot, I helped in fixing this bug, superquick 😀 , https://github.com/scikit-learn/scikit-learn/pull/3178 )

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: