Machine Learning – A few basic concepts

Hi, I had taken this informal Stanford machine learning course last year, and since my GSoC finally got over I thought I would take it a bit more seriously. The Machine Learning course was in octave, and I started implementing a few algorithms using Python and Numpy. Here are a few must now concepts that I had learnt, I’m putting them here, which would save me effort from googling again. I shall update this post as soon as I come across more terms.

Type of algorithms
1] Supervised algorithms – Supervised algorithms are those in which you have both output and input data, and you throw it to the computer and ask it to predict the output for a given input.
E.g) You have the historical data of say, the runs scored for/against, the wickets lost/taken, the number of 100’s/50’s for the last 500 matches India had played and you want to predict if India will win/lose the next match, you need a supervised algorithm.

2] Unsupervised algorithms – Unsupervised algorithms is when you have the input data but do not have the output data and according to the input data, the computer figures out the output itself, A typical unsupervised algorithm is the K-means clustering algorithm.

Types of ML problems based on output data
1] Classification – A classification problem is when the output data has discrete values, These are usually called labels, For example, if you want to predict if it will rain tomorrow, there are two possibilities, it will rain or it will not. There is no in between.

2] Regression: A regression problem is when the output data is continuous. For example. if you want to predict, the price of the house given the area and the number of rooms, it is a regression problem, since house prices can have a number of values. (This example and many other examples from now on is/will be shamelessly copied from Andrew Ng’s lectures)

Distributions
Since a major part of Machine learning include Probability and Statistics (fortunately or unfortunately), most often the input data and output data are assumed to follow certain distributions.
1] Bernoulli Distribution – In any typical binary classification problem, the output data follows a Bernoulli distribution, parametrised by $\phi$, and one of the focus of the ML algorithm would be to find this parameter $\phi$
$y~f(\phi) = \phi^{y}*(1-\phi)^{1 - y}$. $y$ can take two values, 1 and 0, If it is 1, then it outputs value $latex\phi$ and outputs $(1 - \phi)$

2] Normal distribution – In a regression problem, more often than not, since the output data is continuous, it is assumed to follow a normal distribution with parameters $\nu$ and $\sigma$. The distributions which is a bell shaped curve is given by
$\frac{1}{\sigma\sqrt{2\pi}}\*e^\frac{-(x-\nu)^{2}}{2\sigma^{2}}$. As can be seen from the formula, the values closer to the mean, have a higher probability, since the numerator is a decreasing exponential. Here x is a vector of 1-D, when vector is n-D or more specifically when there are n feature vectors, this can be generalised to follow a multivariate normal distribution, which we shall hopefully see later.

Miscellaneous
Normalisation is when the input data has n feature vectors and when the features have different dimensions. For example, say one feature is the area of the house, and the other the number of rooms, we would get a set of vectors like this [20000, 1], [30000, 2], [50000, 1], and your algo would treat both equally. So input data is normalised, This is done by subtracting it with the mean and dividing it by the standard deviation.

I think thats enough. Lets hopefully look at linear regression in the next blogpost.