Hi, before I start writing anything, I would say sorry if this article (and the forthcoming articles hopefully) seems plagiarised. It mainly is from different sources, that I’ve written in my own words, mostly for my own reference, I would really love to see someone else benefit from this too, so here we go.
Lets start with some basic definitions (which makes people think you know a lot, but which is not really true).
1. L2 norm
Lets say I have a vector , then the L2 norm is just a really cool way of saying the Euclidean distance of the vector from the origin. For the given example, the L2 norm would be .
2. Training data
A set of n examples that you train your data upon. Usually a superscript is used to denote the th example, and a subscript is used to determine the th feature. For instance,
would refer to the th feature of the th example of your training data.
3. Hypothesis function
Why do you train data? Simple. To get a hypothesis function, that you can use to predict output. Each machine learning algorithm has its own hypothesis function, that needs to be predicted, which is the principle aim of training the data.
Setting these formal definitions aside, lets look at linear regression a bit in detail. As I mentioned in my previous post, regression is when your output is continuous (Continuous means it can take any value within a given range). Linear is because, umm .. well there is a linear fit to your
Hypothesis Function for linear regression: The hypothesis function for linear regression is given by or (where }, or more compactly . (From the hypothesis function, one can understand why its called linear)
Objective: Our goal is to predict in this hypothesis function. Now how do we do this? Every ML algo has some optimisation problem, which is used to output the parameters. For this particular algo, this is the problem, , which implies that the theta we need is the theta that we get by minimising the function in the bracket.( is the actual output, for each training example by the way.). In order to make the maths bit simpler, the function is written as . Lets call this the cost function .
Intutively think of it this way, the fit is actually a hyperplane (or line in case of 2-D) passing through the points and our aim is so as to minimise the sum of the distance of the points from the line (which means that is the best possible fit for the line). Fundamental calculus tells us that if we have a function say , and we want to minimise the function with respect to the variable , we differentiate the function with respect to zero. Similarly, we need to do find such that, .
Apart from doing this directly, there are two iterative methods to solve this optimisation problem.
1. Batch gradient algorithm.
Following assumptions are made:
1. Number of input examples are m, and the number of features for each example are n. This implies the actual number of features we take in the algorithm are n + 1, including the dummy 1 included (see above).
Lets do this step by step (in python and numpy).
a] Initialise theta to be of size n + 1
theta_init = np.zeros(n + 1)
b] We need to find each , such that is zero. We could this way, or use the recursive formula below iteratively, till converges
If you observe the above formula, you can see that when converges, is zero, which explains it.
Simplifying 1] we get.
Since this post is getting too long and as I am also bored of writing, lets look at how to implement this in Python and Stochastic Gradient Algorithm in detail in the next post.