Problem Formulation

Suppose we have a dataset giving the living areas and prices of 97 houses from Portland, Oregon:

House Prices
Living area (feet^2) Price (1000$s)
2104 400
1600 330
2400 369
1416 232
3000 540
\vdots \vdots

We can plot this data:

Our goal in linear regression is to predict a target value y starting from a vector of input values x \in \Re^n. For example, we might want to make predictions about the price of a house so that y represents the price of the house in dollars and the elements x_j of x represent “features” that describe the house (such as its size and the number of bedrooms). Suppose that we are given many examples of houses where the features for the i’th house are denoted x^{(i)} and the price is y^{(i)}.

Our goal is to find a function y = h(x) so that we have y^{(i)} \approx h(x^{(i)}) for each training example. If we succeed in finding a function h(x) like this, and we have seen enough examples of houses and their prices, we hope that the function h(x) will also be a good predictor of the house price even when we are given the features for a new house where the price is not known.

To find a function h(x) where y^{(i)} \approx h(x^{(i)}) we must first decide how to represent the function h(x). To start out we will use linear functions: h_\theta(x) = \sum_j \theta_j x_j = \theta^\top x. Here, h_\theta(x) represents a large family of functions parametrized by the choice of \theta. (We call this space of functions a “hypothesis class”.) With this representation for h, our task is to find a choice of \theta so that h_\theta(x^{(i)}) is as close as possible to y^{(i)}. In particular, we will search for a choice of h_\theta(x^{(i)}) that minimizes:

\huge J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 = \frac{1}{2m} \sum_{i=1}^m \left( \theta^\top x^{(i)} - y^{(i)} \right)^2

This function is the “cost function” for our problem which measures how much error is incurred in predicting y^{(i)} for a particular choice of \theta. This may also be called a “loss”, “penalty” or “objective” function.

Function Minimization

We now want to find the choice of \theta that minimizes J(\theta) as given above. There are many algorithms for minimizing functions like this one and we will describe some very effective ones that are easy to implement yourself in a later section Gradient descent.
To do so, let’s use a search algorithm that starts with some “initial guess” for J(\theta):

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta)

(This update is simultaneously performed for all values of j = 0, . . . , n.)
Here, \alpha is called the learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of J.
In order to implement this algorithm, we have to work out what is the partial derivative term on the right hand side. Let’s first work it out for the case of if we have only one training example (x, y), so that we can neglect the sum in the definition of J. We have:

\frac{\partial}{\partial \theta_j}J(\theta) = \frac{\partial}{\partial \theta_j}\frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}

For a single training example, this gives the update rule:

\theta_j := \theta_j - \alpha\frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}

The rule is called the LMS update rule (LMS stands for “least mean squares”), and is also known as the Widrow-Hoff learning rule. This rule has several properties that seem natural and intuitive. For instance, the magnitude of the update is proportional to the error term (y^{(i)} - h_\theta(x^{(i)})); thus, for instance, if we are encountering a training example on which our prediction nearly matches the actual value of y^{(i)}, then we find that there is little need to change the parameters; in contrast, a larger change to the parameters will be made if our prediction h_\theta(x^{(i)}) has a large error (i.e., if it is very far from y^{(i)}.
We’d derived the LMS rule for when there was only a single training example. Replace it with the following algorithm:

Repeat until convergence {
\theta_j := \theta_j - \alpha\frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)} (for every j)

The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by straight lines) mark the successive values of \theta that gradient descent went through.

Code Implementation

For now, let’s take for granted the fact that most commonly-used algorithms for function minimization require us to provide two pieces of information about J(\theta): We will need to write code to compute J(\theta) and \nabla_\theta J(\theta) on demand for any choice of \theta. After that, the rest of the optimization procedure to find the best choice of \theta will be handled by the optimization algorithm. (Recall that the gradient \nabla_\theta J(\theta) of a differentiable function J is a vector that points in the direction of steepest increase as a function of \theta — so it is easy to see how an optimization algorithm could use this to make a small change to \theta that decreases (or increase) J(\theta).

The above expression for J(\theta) given a training set of x^{(i)} and y^{(i)} is easy to implement in MATLAB to compute J(\theta) for any choice of \theta. The remaining requirement is to compute the gradient:

\nabla_\theta J(\theta) = \begin{bmatrix}\frac{\partial J(\theta)}{\partial \theta_1}\\\frac{\partial J(\theta)}{\partial \theta_2}\\\vdots\\\frac{\partial J(\theta)}{\partial \theta_n}\end{bmatrix}

Differentiating the cost function J(\theta) as given above with respect to a particular parameter \theta_j gives us:

\frac{\partial J(\theta)}{\partial \theta_j} = \sum_i x^{(i)}_j \left(h_\theta(x^{(i)}) - y^{(i)}\right)

We can use matrix to calculate the \nabla_\theta J(\theta), as following:

\nabla_\theta J(\theta) = \begin{bmatrix}x_{11} & \cdots & x_{1j}\\x_{21} & \cdots & x_{2j}\\\vdots & \vdots & \vdots\\x_{n1} & \cdots & x_{nj}\end{bmatrix}^\top \times \Bigg(\begin{bmatrix}x_{11} & x_{12} & \cdots & x_{1j}\\x_{21} & x_{22} & \cdots & x_{2j}\\\vdots & \vdots & \vdots & \vdots\\x_{n1} & x_{n2} & \cdots & x_{nj}\end{bmatrix} \times \begin{bmatrix}\theta_1\\\theta_2\\\vdots\\\theta_j\end{bmatrix} - \begin{bmatrix}y_1\\y_2\\\vdots\\y_n\end{bmatrix}\Bigg) \div m


\nabla_\theta J(\theta) =  \nabla X^\top \times (\nabla X \times \nabla \theta - \nabla Y) \div m

Now we can implement the Gradient Descent in Matlab:

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
%GRADIENTDESCENT Performs gradient descent to learn theta
%   theta = GRADIENTDESENT(X, y, theta, alpha, num_iters) updates theta by 
%   taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
m = length(y); % number of training examples

for iter = 1:num_iters
    delta = X' * (X * theta - y) / m;
    theta = theta - alpha * delta;



For C++ implementation, I choose to use Armadillo C++ linear algebra library. There are quite a lot C++ linear algebra library, readers can refer to the Wiki to the Comparison of linear algebra libraries. The reason why I choose Armadillo is Armadillo’s language features are more likely to Matlab. So I can use Matlab to design and debug the algorithms of Machine Learning, then I can easily implement it in C++ for commercial use in good performance.
There is a tool which is written in python can help you to transfer your Matlab code into Armadillo C++ code. But it C++ cannot compiled normally because Matlab does not declare variables explicitly. So the tool can just help you for Armadillo C++ reference.

I try to implement Andrew Ng’s Linear Regression home work in C++ code, following is the implementation, I cut off the codes for plot. You can find it is so like Matlab code:

#include <armadillo>
#include <iostream>
#include <stdio.h>

using namespace std;
using namespace arma;

mat computeCost(const mat& X, const mat& y, const mat& theta)
	mat J;
	int m;
	m = y.n_rows;
	J = arma::sum((pow(((X*theta)-y), 2))/(2*m)) ;
	return J;

void gradientDescent(const mat&    X,
                     const mat&    y,
                           double  alpha,
                           int     num_iters,
                           mat&    theta)
	mat delta;
	int iter;
	int m ;
	m = y.n_rows;
	//vec J_history = arma::zeros<vec>(num_iters) ;
	for (iter = 0; iter < num_iters; iter++)
		delta = arma::trans(X)*(X*theta-y)/m ;
		theta = theta-alpha*delta ;
		//J_history(iter) = computeCost(X, y, theta)(0) ;

int main()
	mat data;
	mat X = data.col(0);
	mat y = data.col(1);
	int m = X.n_elem;
	cout << "m = " << m << endl;
	vec X_One(m);
	X.insert_cols(0, X_One);
	//cout << "after insert_cols:" << X.n_elem << endl;
	mat theta = arma::zeros<vec>(2);
	int iterations = 1500 ;
	double alpha = 0.01 ;
	mat J = computeCost(X, y, theta);
	gradientDescent(X, y, alpha, iterations, theta) ;
	printf("Theta found by gradient descent: \n") ;
	printf("%f %f \n", theta(0), theta(1)) ;
	return 0;


The result compared with Octave:



We run batch gradient descent to fit θ on our previous dataset, to learn to predict housing price as a function of living area, we obtain \theta_0 = -3.630291, \theta_1 = 1.166362. If we plot h_\theta as a function of x (area), along with the training data, we obtain the following figure:

3D  contours of a quadratic function:



UFLDL Tutorial – Linear Regression
Andrew Ng Machine Learning (Coursera)
Gradient Descent (Wiki)
Least Squares (Wiki)
Least mean squares filter
Armadillo Doc
Comparison of linear algebra libraries

I am trying to use the C++11 to support the smart pointer, but I find there is no shard_array in <memory>, so I try to use it in this way, and I know this maybe WRONG:

shared_ptr<int> sp(new int[10]);

Then run it, it coredumped as I guessed:

$ smart_ptr/Test_shared_array
Destructing a Foo with x=0
*** Error in `smart_ptr/Test_shared_array': munmap_chunk(): invalid pointer: 0x0000000001d58018 ***
[1]    14128 abort (core dumped)  smart_ptr/Test_shared_array

Use GDB to see more information:

(gdb) run
Starting program: /home/nasacj/projects/woodycxx/smart_ptr/Test_shared_array
Destructing a Foo with x=0
*** Error in `/home/nasacj/projects/woodycxx/smart_ptr/Test_shared_array': munmap_chunk(): invalid pointer: 0x0000000000603018 ***

Program received signal SIGABRT, Aborted.
0x00007ffff7530cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff7530cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff75340d8 in __GI_abort () at abort.c:89
#2  0x00007ffff756df24 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff767c6c8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff7578c87 in malloc_printerr (action=<optimized out>, str=0x7ffff767ca48 "munmap_chunk(): invalid pointer", ptr=<optimized out>) at malloc.c:4996
#4  0x0000000000400d9f in _M_release (this=0x603050) at /usr/include/c++/4.8/bits/shared_ptr_base.h:144
#5  ~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.8/bits/shared_ptr_base.h:546
#6  ~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.8/bits/shared_ptr_base.h:781
#7  ~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.8/bits/shared_ptr.h:93
#8  test () at Test_shared_array.cpp:30
#9  0x0000000000400bc9 in main () at Test_shared_array.cpp:36
(gdb) quit

Then I realize that in Boost, user should provide a deleter to shared_ptr:

Then I find this in stackoverflow:

By default, shared_ptr will call delete on the managed object when no more references remain to it. However, when you allocate using new[] you need to call delete[], and not delete, to free the resource.

In order to correctly use shared_ptr with an array, you must supply a custom deleter.

template< typename T >
struct array_deleter
  void operator ()( T const * p)
    delete[] p;

Create the shared_ptr as follows

std::shared_ptr<int> sp( new int[10], array_deleter<int>() );

Now shared_ptr will correctly call delete[] when destroying the managed object.

With C++11, you can also use a lambda instead of the functor.

std::shared_ptr<int> sp( new int[10], []( int *p ) { delete[] p; } );

Also, unless you actually need to share the managed object, a unique_ptr is better suited for this task, since it has a partial specialization for array types.

std::unique_ptr<int[]> up( new int[10] ); // this will correctly call delete[]

Now there come the shared array STD version in practice:

//#include "shared_array.h"
#include <memory>
#include <iostream>

using namespace std;

struct Foo
    Foo() : x(0) {}
	Foo( int _x ) : x(_x) {}
	~Foo() { std::cout << "Destructing a Foo with x=" << x << "\n"; }
	int x;
	/* ... */

template< typename T >
struct array_deleter
  void operator ()( T const * p)
    delete[] p;

//typedef woodycxx::smart_prt::shared_array<Foo> FooArray;
typedef shared_ptr<Foo> FooArray;

void test()
	FooArray(new Foo[10], array_deleter<Foo>());

int main()
	return 0;

The Output:

$ ./Test_shared_array
Destructing a Foo with x=0
Destructing a Foo with x=0
Destructing a Foo with x=0
Destructing a Foo with x=0
Destructing a Foo with x=0
Destructing a Foo with x=0
Destructing a Foo with x=0
Destructing a Foo with x=0
Destructing a Foo with x=0
Destructing a Foo with x=0

Smart Pointer Programming Techniques

Using incomplete classes for implementation hiding
The “Pimpl” idiom
Using abstract classes for implementation hiding
Preventing delete px.get()
Using a shared_ptr to hold a pointer to an array
Encapsulating allocation details, wrapping factory functions
Using a shared_ptr to hold a pointer to a statically allocated object
Using a shared_ptr to hold a pointer to a COM object
Using a shared_ptr to hold a pointer to an object with an embedded reference count
Using a shared_ptr to hold another shared ownership smart pointer
Obtaining a shared_ptr from a raw pointer
Obtaining a shared_ptr (weak_ptr) to this in a constructor
Obtaining a shared_ptr to this
Using shared_ptr as a smart counted handle
Using shared_ptr to execute code on block exit
Using shared_ptr<void> to hold an arbitrary object
Associating arbitrary data with heterogeneous shared_ptr instances
Using shared_ptr as a CopyConstructible mutex lock
Using shared_ptr to wrap member function calls
Delayed deallocation
Weak pointers to objects not managed by a shared_ptr

When I am reading the UNIX Network Programming V3 :

Three-Way Handshake

The following scenario occurs when a TCP connection is established:

  1. The server must be prepared to accept an incoming connection. This is normally done by calling socket, bind, and listen and is called a passive open.

  2. The client issues an active open by calling connect. This causes the client TCP to send a “synchronize” (SYN) segment, which tells the server the client’s initial sequence number for the data that the client will send on the connection. Normally, there is no data sent with the SYN; it just contains an IP header, a TCP header, and possible TCP options (which we will talk about shortly).

  3. The server must acknowledge (ACK) the client’s SYN and the server must also send its own SYN containing the initial sequence number for the data that the server will send on the connection. The server sends its SYN and the ACK of the client’s SYN in a single segment.

  4. The client must acknowledge the server’s SYN.

I am wonderring what is the meaning of a PASSIVE open?

Do some search and take some notes here:

Same question from Stackoverflow:

What is the difference between ACTIVE and PASSIVE connect in RFC 1006 TCP connections?

It’s explained here:

A passive OPEN request means that the process wants to accept incoming connection requests rather than attempting to initiate a connection.

In short passive OPEN are listen() and active OPEN are connect().


The TCP/IP Guide

TCP Connection Preparation: Transmission Control Blocks (TCBs) and Passive and Active Socket OPENs

Active and Passive OPENs

TCP/IP is based on the client/server model of operation, and TCP connection setup is based on the existence of these roles as well. The client and server each prepare for the connection by performing an OPEN operation. However, there are two different kinds of OPEN:

  • Active OPEN: A client process using TCP takes the “active role” and initiates the connection by actually sending a TCP message to start the connection (a SYN message).
  • Passive OPEN: A server process designed to use TCP, however, takes a more “laid-back” approach. It performs a passive OPEN by contacting TCP and saying “I am here, and I am waiting for clients that may wish to talk to me to send me a message on the following port number”. The OPEN is called passive because aside from indicating that the process is listening, the server process does nothing.

A passive OPEN can in fact specify that the server is waiting for an active OPEN from a specific client, though not all TCP/IP APIs support this capability. More commonly, a server process is willing to accept connections from all comers. Such a passive OPEN is said to be unspecified.

Key Concept: A client process initiates a TCP connection by performing an active OPEN, sending a SYN message to a server. A server process using TCP prepares for an incoming connection request by performing a passive OPEN. Both devices create for each TCP session a data structure used to hold important data related to the connection, called a transmission control block (TCB).

Preparation For Connection

Both the client and the server create the TCB for the connection at the time that they perform the OPEN. The client already knows the IP addresses and port numbers for both the client process and the server process it is trying to reach, so it can use these to uniquely identify the connection and the TCB that goes with it.

For the server, the concept of a TCB at this stage of the game is a bit more complex. If the server is in fact waiting for a particular client, it can identify the connection using its own socket and the socket of the client for which it is waiting. Normally, however, the server doesn’t know what client is trying to reach it. In fact, it could be contacted by more than one client nearly at the same time.

In this case, the server creates a TCB with an unspecified (zero) client socket number, and waits for an active OPEN to be received. It then binds the socket number of the client to the TCB for the passive OPEN as part of the connection process. To allow it to handle multiple incoming connections, the server process may in fact perform several unspecified passive OPENs simultaneously.

The transmission control block for a connection is maintained throughout the connection and destroyed when the connection is completely terminated and the device returns to the CLOSED state. TCP does include a procedure to handle the situation where both devices perform an active OPEN simultaneously. This is discussed in more detail in the next topic on the connection establishment process.