Hi, the first week of (another) Google Summer of Code has ended, and I just realised that it is not going to be a joyride. Most of this week was spent in benchmarking, and trying to complete (?) this PR and so I did not write much code. However I learnt quite a bit in my attempt to complete this PR https://github.com/scikit-learn/scikit-learn/pull/3102 .
Q: What is joblib.Parallel and why is it used?
For computationally intensive numerical purposes, it always beneficial to splits the work among separate CPU cores, and the API makes it really easy to do so.
# Across one core list_ = [sqrt(i**2) for i in range(10)] # Across 4 cores. from sklearn.externals.joblib import Parallel, delayed # Parallel (n_jobs = x) (delayed(func)(arg) iter) list_ = Parallel(n_jobs=2)(delayed(sqrt)(i**2) for i in range(10))
In the fit method of ElasticNetCV, we can see the following lines.
jobs = (delayed(_path_residuals)(X, y, train, test, self.path, path_params, alphas=this_alphas, l1_ratio=this_l1_ratio, X_order='F', dtype=np.float64) for this_l1_ratio, this_alphas in zip(l1_ratios, alphas) for train, test in folds) mse_paths = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)(jobs)
Let us say the the number of folds is 10, alphas is 100, and n_l1_ratios is 3, then enet_coordinate_descent has to be run atleast 3000 times, where it does help a lot to split it across different cores. However, in the example that I have mentioned using for i in range(10), you can clearly see (if you use IPython magic %timeit), that the loop with Parallel runs much slower due to the overhead.
Q: What is backend = “multiprocessing” or “threading” in Parallel?
Multiprocessing is the default backend of joblib. When this is used is there are multiple processes inclusing the parent process across different CPU cores. One must remember that there is memory duplication when the number of bytes is less than 1e6, and when it is greater that 1e6 the memory of the data given to Parallel is shared by all the child processes.
Threading is the more efficient of the two, both speed-wise and memory-wise (since the memory is shared between all input threads). Then why isn’t this being used? Well, it is because of the Global Interpreter Lock (GIL). The GIL prevents different threads from running simultaneously across different cores, which makes it less efficient. Each thread is said to acquire the GIL, when it runs and is said to release it, when it allows one of its fellow threads, the chance. For a more detailed overview of the GIL, please have a look at this PyCon talk. http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2010-understanding-the-python-gil-82-3273690 , This provides a disadvantage.
Q: How to release the GIL to efficiently use threading?
If your code is in Cython, you can efficiently release the GIL, by using a “with nogil” block, and replacing the Python / NumPy objects with raw cblas calls, For instance
# NumPy np.dot(x, y) # CBlas Assuming x and y have shape n cblas_ddot(n, <double*>&x, 1, <double*>&y, 1)
In master, you can see that not all the cores are up to their full performance,
Q: All this sounds great, but what have you done this week?
Most of this week was spent in replacing the NumPy calls with cblas calls and running benchmarks that I’ve dumped here. https://github.com/MechCoder/Sklearn_benchmarks . From my benchmarks it seems that threading has a slight advantage in speed over multiprocessing,
However the memory benchmarks using memory_profiler have been ambiguous. Thankfully Olivier, has sent a PR to resolve the issue, and next week I will be back with better updates. Cheers.
(Oh and I almost forgot, I helped in fixing this bug, superquick 😀 , https://github.com/scikit-learn/scikit-learn/pull/3178 )