Problem Formulation

Suppose we have a dataset giving the living areas and prices of 97 houses from Portland, Oregon:

House Prices
Living area ($feet^2$) Price (1000\$s)
2104 400
1600 330
2400 369
1416 232
3000 540
$\vdots$ $\vdots$

We can plot this data:

Our goal in linear regression is to predict a target value $y$ starting from a vector of input values $x \in \Re^n$. For example, we might want to make predictions about the price of a house so that $y$ represents the price of the house in dollars and the elements $x_j$ of $x$ represent “features” that describe the house (such as its size and the number of bedrooms). Suppose that we are given many examples of houses where the features for the i’th house are denoted $x^{(i)}$ and the price is $y^{(i)}$.

Our goal is to find a function $y = h(x)$ so that we have $y^{(i)} \approx h(x^{(i)})$ for each training example. If we succeed in finding a function $h(x)$ like this, and we have seen enough examples of houses and their prices, we hope that the function $h(x)$ will also be a good predictor of the house price even when we are given the features for a new house where the price is not known.

To find a function $h(x)$ where $y^{(i)} \approx h(x^{(i)})$ we must first decide how to represent the function $h(x)$. To start out we will use linear functions: $h_\theta(x) = \sum_j \theta_j x_j = \theta^\top x$. Here, $h_\theta(x)$ represents a large family of functions parametrized by the choice of $\theta$. (We call this space of functions a “hypothesis class”.) With this representation for $h$, our task is to find a choice of $\theta$ so that $h_\theta(x^{(i)})$ is as close as possible to $y^{(i)}$. In particular, we will search for a choice of $h_\theta(x^{(i)})$ that minimizes:

$\huge J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 = \frac{1}{2m} \sum_{i=1}^m \left( \theta^\top x^{(i)} - y^{(i)} \right)^2$

This function is the “cost function” for our problem which measures how much error is incurred in predicting $y^{(i)}$ for a particular choice of $\theta$. This may also be called a “loss”, “penalty” or “objective” function.

Function Minimization

We now want to find the choice of $\theta$ that minimizes $J(\theta)$ as given above. There are many algorithms for minimizing functions like this one and we will describe some very effective ones that are easy to implement yourself in a later section Gradient descent.
To do so, let’s use a search algorithm that starts with some “initial guess” for $J(\theta)$:

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta)$

(This update is simultaneously performed for all values of $j$ = 0, . . . , n.)
Here, $\alpha$ is called the learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of $J$.
In order to implement this algorithm, we have to work out what is the partial derivative term on the right hand side. Let’s first work it out for the case of if we have only one training example $(x, y)$, so that we can neglect the sum in the definition of $J$. We have:

$\frac{\partial}{\partial \theta_j}J(\theta) = \frac{\partial}{\partial \theta_j}\frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}$

For a single training example, this gives the update rule:

$\theta_j := \theta_j - \alpha\frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}$

The rule is called the LMS update rule (LMS stands for “least mean squares”), and is also known as the Widrow-Hoﬀ learning rule. This rule has several properties that seem natural and intuitive. For instance, the magnitude of the update is proportional to the error term $(y^{(i)} - h_\theta(x^{(i)}))$; thus, for instance, if we are encountering a training example on which our prediction nearly matches the actual value of $y^{(i)}$, then we find that there is little need to change the parameters; in contrast, a larger change to the parameters will be made if our prediction $h_\theta(x^{(i)})$ has a large error (i.e., if it is very far from $y^{(i)}$.
We’d derived the LMS rule for when there was only a single training example. Replace it with the following algorithm:

Repeat until convergence {
$\theta_j := \theta_j - \alpha\frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}$ (for every j)
}

The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by straight lines) mark the successive values of $\theta$ that gradient descent went through.

Code Implementation

For now, let’s take for granted the fact that most commonly-used algorithms for function minimization require us to provide two pieces of information about $J(\theta)$: We will need to write code to compute $J(\theta)$ and $\nabla_\theta J(\theta)$ on demand for any choice of $\theta$. After that, the rest of the optimization procedure to find the best choice of $\theta$ will be handled by the optimization algorithm. (Recall that the gradient $\nabla_\theta J(\theta)$ of a differentiable function $J$ is a vector that points in the direction of steepest increase as a function of $\theta$ — so it is easy to see how an optimization algorithm could use this to make a small change to $\theta$ that decreases (or increase) $J(\theta)$.

The above expression for $J(\theta)$ given a training set of $x^{(i)}$ and $y^{(i)}$ is easy to implement in MATLAB to compute $J(\theta)$ for any choice of $\theta$. The remaining requirement is to compute the gradient:

$\nabla_\theta J(\theta) = \begin{bmatrix}\frac{\partial J(\theta)}{\partial \theta_1}\\\frac{\partial J(\theta)}{\partial \theta_2}\\\vdots\\\frac{\partial J(\theta)}{\partial \theta_n}\end{bmatrix}$

Differentiating the cost function $J(\theta)$ as given above with respect to a particular parameter $\theta_j$ gives us:

$\frac{\partial J(\theta)}{\partial \theta_j} = \sum_i x^{(i)}_j \left(h_\theta(x^{(i)}) - y^{(i)}\right)$

We can use matrix to calculate the $\nabla_\theta J(\theta)$, as following:

$\nabla_\theta J(\theta) = \begin{bmatrix}x_{11} & \cdots & x_{1j}\\x_{21} & \cdots & x_{2j}\\\vdots & \vdots & \vdots\\x_{n1} & \cdots & x_{nj}\end{bmatrix}^\top \times \Bigg(\begin{bmatrix}x_{11} & x_{12} & \cdots & x_{1j}\\x_{21} & x_{22} & \cdots & x_{2j}\\\vdots & \vdots & \vdots & \vdots\\x_{n1} & x_{n2} & \cdots & x_{nj}\end{bmatrix} \times \begin{bmatrix}\theta_1\\\theta_2\\\vdots\\\theta_j\end{bmatrix} - \begin{bmatrix}y_1\\y_2\\\vdots\\y_n\end{bmatrix}\Bigg) \div m$

So:

$\nabla_\theta J(\theta) = \nabla X^\top \times (\nabla X \times \nabla \theta - \nabla Y) \div m$

Now we can implement the Gradient Descent in Matlab:

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
%   taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
m = length(y); % number of training examples

for iter = 1:num_iters
delta = X' * (X * theta - y) / m;
theta = theta - alpha * delta;

end
end


For C++ implementation, I choose to use Armadillo C++ linear algebra library. There are quite a lot C++ linear algebra library, readers can refer to the Wiki to the Comparison of linear algebra libraries. The reason why I choose Armadillo is Armadillo’s language features are more likely to Matlab. So I can use Matlab to design and debug the algorithms of Machine Learning, then I can easily implement it in C++ for commercial use in good performance.
There is a tool which is written in python can help you to transfer your Matlab code into Armadillo C++ code. But it C++ cannot compiled normally because Matlab does not declare variables explicitly. So the tool can just help you for Armadillo C++ reference.

I try to implement Andrew Ng’s Linear Regression home work in C++ code, following is the implementation, I cut off the codes for plot. You can find it is so like Matlab code:

#include <armadillo>
#include <iostream>
#include <stdio.h>

using namespace std;
using namespace arma;

mat computeCost(const mat& X, const mat& y, const mat& theta)
{
mat J;
int m;
m = y.n_rows;
J = arma::sum((pow(((X*theta)-y), 2))/(2*m)) ;
return J;
}

const mat&    y,
double  alpha,
int     num_iters,
mat&    theta)
{
mat delta;
int iter;
int m ;
m = y.n_rows;
//vec J_history = arma::zeros<vec>(num_iters) ;
for (iter = 0; iter < num_iters; iter++)
{
delta = arma::trans(X)*(X*theta-y)/m ;
theta = theta-alpha*delta ;
//J_history(iter) = computeCost(X, y, theta)(0) ;
}
//J_history.print("J_history");
}

int main()
{
mat data;
//data.print("ex1data1:");
mat X = data.col(0);
mat y = data.col(1);
//X.print("X:");
//y.print("y:");

int m = X.n_elem;
cout << "m = " << m << endl;

vec X_One(m);
X_One.ones();
X.insert_cols(0, X_One);
//X.print("X:");
//cout << "after insert_cols:" << X.n_elem << endl;

mat theta = arma::zeros<vec>(2);
int iterations = 1500 ;
double alpha = 0.01 ;

mat J = computeCost(X, y, theta);
J.print("J:");

gradientDescent(X, y, alpha, iterations, theta) ;
printf("Theta found by gradient descent: \n") ;
printf("%f %f \n", theta(0), theta(1)) ;

return 0;
}



The result compared with Octave:

Plot

We run batch gradient descent to fit θ on our previous dataset, to learn to predict housing price as a function of living area, we obtain $\theta_0$ = -3.630291, $\theta_1$ = 1.166362. If we plot $h_\theta$ as a function of $x$ (area), along with the training data, we obtain the following figure:

3D  contours of a quadratic function: