Hi, It has been a long time since I had posted something on my blog. I had the opportunity to participate in the scikit-learn sprint recently, with the majority of the core-developers. The experience was awesome, but most of the time I had no idea what people were talking about, and I realised I have to learn a lot. I read somewhere that if you need to keep improving in life, you need to make sure the worst person in the job, and if that is meant to be true, I’m well on the right track.
Anyhow on a more positive note, recently one of my biggest pull requests got merged, ( https://github.com/scikit-learn/scikit-learn/pull/2862 ) and we shall have a quick look at the background, what it can do and what it cannot.
1. What is Logistic Regression?
A Logistic Regression is a regression model that uses the logistic sigmoid function to predict classification. The basic idea is to predict the feature vector sucht that it fits the Logistic_log function, . A quick look at the graph (taken from wikipedia), when is one, we need our estimator to predict to be infinity and vice versa.
Now if we want to fit labels [-1, 1] the sigmoid function becomes . The logistic loss function is given by, . Intuitively this seems correct because when y is 1, we need our estimator to predict to be infinity, to suffer zero loss. Similarly when y is -1, we need out estimator to predict to be -1. Our basic focus is to optimize for loss.
2. How can this be done?
This can be done either using block coordinate descent methods, like lightning does, or use the inbuilt solvers that scipy provides like newton_cg and lbfgs. For the newton-cg solver, we need the hessian, or more simply the double derivative matrix of the loss and for the lbfgs solver we need the gradient vector. If you are too lazy to do the math (like me?), you can obtain the values from here, Hessian
3. Doesn’t scikit-learn have a Logistic Regression already?
Oh well it does, but it is dependent on an external library called Liblinear. There are two major problems with this.
a] Warm start, (one cannot warm start, with liblinear since it does not have a coefficient parameter), unless we patch the shipped liblinear code.
b] Penalization of intercept. Penalization is done so that the estimator does not overfit the data, however the intercept is independent of the data (which can be considered analogous to a column of ones), and so it does not make much sense to penalize it.
4. Things that I learnt
Apart from adding a warm start, (there seems to be a sufficient gain in large datasets), and not penalizing the intercept,
a] refit paramter – generally after cross-validating, we take the average of the scores obtained across all folds, and the final fit is done according to the hyperparameter (in this case C) that corresponds to the perfect score. However Gael suggested that one could take the best hyperparameter across every fold (in terms of score) and average these coefficients and hyperparameters. This would prevent the final refit.
b] Parallel OvA – For each label, we perform a OvA, that is to convert into 1 for the label in question, and into -1’s for the other labels. There is a Parallel loop across al loops and folds, and this is to supposed to make it faster.
c] Class weight support: The easiest way to do it is to convert to per sample weight and multiply it to the loss for each sample. But we had faced a small problem with the following three conditions together, class weight dict, solver liblinear and a multiclass problem, since liblinear does not support sample weights.
5. Problems that I faced.
a] The fit intercept = True case is found out to be considerably slower than the fit_intercept=False case. Gaels hunch was that it was because the intercept varies differently as compared to the data, We tried different things, such as preconditioning the intercept, i.e dividing the initial coefficient with the square root of the diagonal of the Hessian, but it did not work and it took one and a half days of time.
b] Using liblinear as an optimiser or a solver for the OvA case.
i] If we use liblinear as a solver, it means supplying the multi-label problem directly to liblinear.train. This would affect the parallelism and we are not sure if liblinear internally works the same way as we think we do. So after a hectic day of refactoring code, we finally decided (sigh) using liblinear as an optimiser is better (i.e we convert the labels to 1 and -1). For more details about Gaels comment, you can have a look at this https://github.com/scikit-learn/scikit-learn/pull/2862#issuecomment-49450285
Phew, this was a long post and I’m not sure if I typed everything as I wanted to. This is what I plan to accomplish in the coming month
1. Finish work on Larsmans PR
2. Look at glmnet for further improvements in the cd_fast code.
3. ElasticNet regularisation from Lightning.