# WoodyLogger

• Low Latency C++11 Asynchronous Logging Library.
• Lock-Free Optimized for Mulit-Core system, so it is very fast. See Latency Benchmark
• Supports multiple log levels: FAULT, ERROR, WARN, INFO, DEBUG, TRACE
• Configurable for output to local file/stdout or both.
• Colorful output on stdout for different log level on UNIX terminal.

# Usage

#include "WoodyLogger.h"
#include <unistd.h>

usleep(1000);
}

int main() {
WOODY_LOGGER_START();
//WOODY_LOGGER_INIT("/dev/null", false);
WOODY_LOGGER_LEVEL(woodycxx::DEBUG);

LOG_TRACE("Test LOG_TRACE: %d", 12345);  // TRACE will output on DEBUG level
LOG_DEBUG("Test LOG_DEBUG: %d", 23456);
LOG_ERROR("Test LOG_ERROR: %d", 6543210);
LOG_WARN ("Test LOG_WARN : %07d", 123);
LOG_INFO ("Test LOG_INFO : %.5f", 1.0/3.0);

return 0;
}


## API Description

### Macro Define: WOODY_LOGGER

In “WoodyLogger.h”, there is a macro #define WOODY_LOGGER. User can use it to control the Log API calling when compiling. If WOODY_LOGGER is not defined, any call of WOODY_LOGGER_XXXX will not be compilied. So user can put WoodyLogger in code for test and debug usage and then undefine the WOODY_LOGGER macro to turn off WoodyLogger.

### WOODY_LOGGER_START

Start WoodyLogger, which starts the Consumer thread. It is just needed to be called only once. The consumer thread will keep running until WOODY_LOGGER_STOP() is called.

WOODY_LOGGER_START() should be called in main() function, and before any LOG_XXXX call.

### WOODY_LOGGER_INIT

WOODY_LOGGER_INIT marco is used to configure the log output, it has 2 parameters:

("file_path_name", is_enalbe_output_to_stdout)

e.g.

WOODY_LOGGER_INIT("./log.txt");

Will configure the log output into “./log.txt” file and also the stdout. If user want to turn-off the stdout:

WOODY_LOGGER_INIT("./log.txt", false); //disable stdout

User can also configure the FILE handler (e.g. stdout, stderr):

WOODY_LOGGER_INIT(stderr);

User can direct the file name to "/dev/null" and disable the stdout, then the log will NOT output any:

WOODY_LOGGER_INIT("/dev/null", false);

This is used in benchmark test. If WOODY_LOGGER_INIT is not configured, it will output to stdout in default as the demo shows.

### WOODY_LOGGER_LEVEL

There are 6 log levels as Strong to Week order:

FAULT, ERROR, WARN, INFO, DEBUG, TRACE

Logger will only output the level which is stronger than the current configured level.

e.g.

If the level is setup to INFO, then DEBUG and TRACE will not be output.

WOODY_LOGGER_LEVEL() can be called in a thread which means the log level can be controlled dynamically. User can write a thread listen to a port as a Server. The client process can connect to the server and configure the level dynamically.

### LOG_XXXX

LOG_TRACE, LOG_DEBUG, LOG_INFO, LOG_WARN, LOG_ERROR, LOG_FAULT

#### The output format

##### DEBUG, TRACE Level:

[YYYY-MM-DD HH:mm:ss.nanoseconds] [LEVEL] [thread_id] [__FILE__: __LINE__] message

##### FAULT, ERROR, WARN, INFO Level:

[YYYY-MM-DD HH:mm:ss.nanoseconds] [LEVEL] [thread_id] message

### WOODY_LOGGER_STOP

WoodyLogger will start a consumer thread print out the log message. The thread will keep running if user does not call WOODY_LOGGER_STOP()

# Benchmark

When I search “Low-Latency C++ Log” in github, I find a very fast C++ log project: NanoLog. It has been proved much faster than spdlog and others by itself. So I am trying to use the benchmark test code in NanoLog so that If WoodyLogger is proved fater than NanoLog, WoodyLogger will be fater than the others. NanoLog’s design is so much like WoodyLogger, but WoodyLogger is optimized for multi-core system, it can take advantage in multi-thread environment.

## Benchmark Code Compiling Option And System environment

The benchmark test code is run under my MacBook Pro Laptop:

Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
Total Number of Cores:  4


### Test Code

benchmark.cpp

WoodyLogger and NanoLog are both Compiled under -O2 optimization level.

# Design

WoodyLogger’s aim is to impletement a Log tool that can be applied in HFT (High Frequent Trading) System. So the design of WoodyLogger is focusing on Low Latency NOT High Throughput, which means WoodyLogger is not going to output as many logs into file as possible, but is going to reduces the latency to nanoseconds as short as possible.

## Normal Design

### What makes the high latency of a Log tool?

• As a normal design of a “fast” log tool, I/O denseness firtly comes into developers’ mind.

Usually, developers isolate the log writting operation and log calling operation into two threads. There is a ring buffer storing the messages which the Log API is called. Each caller is a Producer which writes out the log messages into the ring buffer. There is a Consumer thread reads the ring buffer and then print out the msg into file or stdout, etc.
Each call of the log API will not be blocked by the I/O dely, so this is a great step to “speed up” (reduce the latency). But it is just the first step, there are more things to do.
• Many developers did not realize that the String Formatter functions are the very important key of high latency. Such as std::stringstream in C++, sprintf in C, etc. Access FMT for reference.

The key latency of string foratter is coverting digital numbers into string. Which is very slow in those function, espicially floating-point formatting.
Although there are very quick Integer Converting and floating-point converting implementations: double-conversion, but they are not as quick as memory copy. So do NOT convert and format any digital numbers into string in Log calling, but do it in the reading/output thread.

Normally the Log tool is designed as following:

The key problem of latency is using the LOCK. If there are more caller threads, they will block each other even the LOCKER is a spin-lock.

## WoodyLogger Design

### Lock-Free Queue

The Locker can be removed by using a Lock-Free Queue. When there are only one producer and one consumer, there is not going to need a lock anymore.

To achieve multi-thread support of a Logger and Lock-Free ring buffer, every thread need a thread_local Lock-Free ring buffer. The consumer thread orginize a std::vector to access every ring buffer (LockFreeQueue) in the threads. The life-cycle of LockFreeQueue is maintained by the thead, when the thread exit, the LockFreeQueue is also destoryed and be removed from the vector.

The WoodyLogger design is going to like the following diagram shows:

## Github Reference

The code of WoodyLogger is not open-sourced, but the library binary file can be downloaded from Github:

https://github.com/nasacj/WoodyLogger_Design

## 央迈勇

Photoed on Oct. 2014

https://zh.wikipedia.org/wiki/%E5%A4%AE%E8%BF%88%E5%8B%87

## 夏诺多吉

Photoed on Oct. 2014

# Problem Formulation

Suppose we have a dataset giving the living areas and prices of 97 houses from Portland, Oregon:

House Prices
Living area ($feet^2$) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540 $\vdots$ $\vdots$ We can plot this data: Our goal in linear regression is to predict a target value $y$ starting from a vector of input values $x \in \Re^n$. For example, we might want to make predictions about the price of a house so that $y$ represents the price of the house in dollars and the elements $x_j$ of $x$ represent “features” that describe the house (such as its size and the number of bedrooms). Suppose that we are given many examples of houses where the features for the i’th house are denoted $x^{(i)}$ and the price is $y^{(i)}$. Our goal is to find a function $y = h(x)$ so that we have $y^{(i)} \approx h(x^{(i)})$ for each training example. If we succeed in finding a function $h(x)$ like this, and we have seen enough examples of houses and their prices, we hope that the function $h(x)$ will also be a good predictor of the house price even when we are given the features for a new house where the price is not known. To find a function $h(x)$ where $y^{(i)} \approx h(x^{(i)})$ we must first decide how to represent the function $h(x)$. To start out we will use linear functions: $h_\theta(x) = \sum_j \theta_j x_j = \theta^\top x$. Here, $h_\theta(x)$ represents a large family of functions parametrized by the choice of $\theta$. (We call this space of functions a “hypothesis class”.) With this representation for $h$, our task is to find a choice of $\theta$ so that $h_\theta(x^{(i)})$ is as close as possible to $y^{(i)}$. In particular, we will search for a choice of $h_\theta(x^{(i)})$ that minimizes: $\huge J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 = \frac{1}{2m} \sum_{i=1}^m \left( \theta^\top x^{(i)} - y^{(i)} \right)^2$ This function is the “cost function” for our problem which measures how much error is incurred in predicting $y^{(i)}$ for a particular choice of $\theta$. This may also be called a “loss”, “penalty” or “objective” function. # Function Minimization We now want to find the choice of $\theta$ that minimizes $J(\theta)$ as given above. There are many algorithms for minimizing functions like this one and we will describe some very effective ones that are easy to implement yourself in a later section Gradient descent. To do so, let’s use a search algorithm that starts with some “initial guess” for $J(\theta)$: $\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta)$ (This update is simultaneously performed for all values of $j$ = 0, . . . , n.) Here, $\alpha$ is called the learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of $J$. In order to implement this algorithm, we have to work out what is the partial derivative term on the right hand side. Let’s first work it out for the case of if we have only one training example $(x, y)$, so that we can neglect the sum in the definition of $J$. We have: $\frac{\partial}{\partial \theta_j}J(\theta) = \frac{\partial}{\partial \theta_j}\frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}$ For a single training example, this gives the update rule: $\theta_j := \theta_j - \alpha\frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}$ The rule is called the LMS update rule (LMS stands for “least mean squares”), and is also known as the Widrow-Hoﬀ learning rule. This rule has several properties that seem natural and intuitive. For instance, the magnitude of the update is proportional to the error term $(y^{(i)} - h_\theta(x^{(i)}))$; thus, for instance, if we are encountering a training example on which our prediction nearly matches the actual value of $y^{(i)}$, then we find that there is little need to change the parameters; in contrast, a larger change to the parameters will be made if our prediction $h_\theta(x^{(i)})$ has a large error (i.e., if it is very far from $y^{(i)}$. We’d derived the LMS rule for when there was only a single training example. Replace it with the following algorithm: Repeat until convergence { $\theta_j := \theta_j - \alpha\frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}$ (for every j) } The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by straight lines) mark the successive values of $\theta$ that gradient descent went through. # Code Implementation For now, let’s take for granted the fact that most commonly-used algorithms for function minimization require us to provide two pieces of information about $J(\theta)$: We will need to write code to compute $J(\theta)$ and $\nabla_\theta J(\theta)$ on demand for any choice of $\theta$. After that, the rest of the optimization procedure to find the best choice of $\theta$ will be handled by the optimization algorithm. (Recall that the gradient $\nabla_\theta J(\theta)$ of a differentiable function $J$ is a vector that points in the direction of steepest increase as a function of $\theta$ — so it is easy to see how an optimization algorithm could use this to make a small change to $\theta$ that decreases (or increase) $J(\theta)$. The above expression for $J(\theta)$ given a training set of $x^{(i)}$ and $y^{(i)}$ is easy to implement in MATLAB to compute $J(\theta)$ for any choice of $\theta$. The remaining requirement is to compute the gradient: $\nabla_\theta J(\theta) = \begin{bmatrix}\frac{\partial J(\theta)}{\partial \theta_1}\\\frac{\partial J(\theta)}{\partial \theta_2}\\\vdots\\\frac{\partial J(\theta)}{\partial \theta_n}\end{bmatrix}$ Differentiating the cost function $J(\theta)$ as given above with respect to a particular parameter $\theta_j$ gives us: $\frac{\partial J(\theta)}{\partial \theta_j} = \sum_i x^{(i)}_j \left(h_\theta(x^{(i)}) - y^{(i)}\right)$ We can use matrix to calculate the $\nabla_\theta J(\theta)$, as following: $\nabla_\theta J(\theta) = \begin{bmatrix}x_{11} & \cdots & x_{1j}\\x_{21} & \cdots & x_{2j}\\\vdots & \vdots & \vdots\\x_{n1} & \cdots & x_{nj}\end{bmatrix}^\top \times \Bigg(\begin{bmatrix}x_{11} & x_{12} & \cdots & x_{1j}\\x_{21} & x_{22} & \cdots & x_{2j}\\\vdots & \vdots & \vdots & \vdots\\x_{n1} & x_{n2} & \cdots & x_{nj}\end{bmatrix} \times \begin{bmatrix}\theta_1\\\theta_2\\\vdots\\\theta_j\end{bmatrix} - \begin{bmatrix}y_1\\y_2\\\vdots\\y_n\end{bmatrix}\Bigg) \div m$ So: $\nabla_\theta J(\theta) = \nabla X^\top \times (\nabla X \times \nabla \theta - \nabla Y) \div m$ Now we can implement the Gradient Descent in Matlab: function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha % Initialize some useful values m = length(y); % number of training examples for iter = 1:num_iters delta = X' * (X * theta - y) / m; theta = theta - alpha * delta; end end  For C++ implementation, I choose to use Armadillo C++ linear algebra library. There are quite a lot C++ linear algebra library, readers can refer to the Wiki to the Comparison of linear algebra libraries. The reason why I choose Armadillo is Armadillo’s language features are more likely to Matlab. So I can use Matlab to design and debug the algorithms of Machine Learning, then I can easily implement it in C++ for commercial use in good performance. There is a tool which is written in python can help you to transfer your Matlab code into Armadillo C++ code. But it C++ cannot compiled normally because Matlab does not declare variables explicitly. So the tool can just help you for Armadillo C++ reference. I try to implement Andrew Ng’s Linear Regression home work in C++ code, following is the implementation, I cut off the codes for plot. You can find it is so like Matlab code: #include <armadillo> #include <iostream> #include <stdio.h> using namespace std; using namespace arma; mat computeCost(const mat& X, const mat& y, const mat& theta) { mat J; int m; m = y.n_rows; J = arma::sum((pow(((X*theta)-y), 2))/(2*m)) ; return J; } void gradientDescent(const mat& X, const mat& y, double alpha, int num_iters, mat& theta) { mat delta; int iter; int m ; m = y.n_rows; //vec J_history = arma::zeros<vec>(num_iters) ; for (iter = 0; iter < num_iters; iter++) { delta = arma::trans(X)*(X*theta-y)/m ; theta = theta-alpha*delta ; //J_history(iter) = computeCost(X, y, theta)(0) ; } //J_history.print("J_history"); } int main() { mat data; data.load("ex1data1.txt"); //data.print("ex1data1:"); mat X = data.col(0); mat y = data.col(1); //X.print("X:"); //y.print("y:"); int m = X.n_elem; cout << "m = " << m << endl; vec X_One(m); X_One.ones(); X.insert_cols(0, X_One); //X.print("X:"); //cout << "after insert_cols:" << X.n_elem << endl; mat theta = arma::zeros<vec>(2); int iterations = 1500 ; double alpha = 0.01 ; mat J = computeCost(X, y, theta); J.print("J:"); gradientDescent(X, y, alpha, iterations, theta) ; printf("Theta found by gradient descent: \n") ; printf("%f %f \n", theta(0), theta(1)) ; return 0; }  The result compared with Octave: ## Plot We run batch gradient descent to fit θ on our previous dataset, to learn to predict housing price as a function of living area, we obtain $\theta_0$ = -3.630291, $\theta_1$ = 1.166362. If we plot $h_\theta$ as a function of $x$ (area), along with the training data, we obtain the following figure: 3D contours of a quadratic function: # Reference ## Weak_ptr solves Cyclic Reference Problem of shared_ptr when designing Interfaces in Object Models My previous post “Use shared_ptr inheritance rightly when design and use interfaces” has an obvious problem: There is a cyclic reference between AbstractSocketImpl and Socket I/O Streams. typedef shared_ptr<InputStream> InputStreamPtr; typedef shared_ptr<OutputStream> OutputStreamPtr; typedef shared_ptr<AbstractSocketImpl> AbstractSocketImplPtr; class AbstractSocketImpl : public AbstractSocket { private: InputStreamPtr inputStreamPtr; OutputStreamPtr outputStreamPtr; } class SocketInputStream : public InputStream { private: AbstractSocketImplPtr socket_impl; }  When two class have shared_ptr hold each other, it will cause meory leak because of cyclic reference. When shared_ptr objects of AbstractSocketImpl and SocketInputStream are created the reference count in SocketInputStreamPtr and AbstractSocketImplPtr will count each other, so the refer_cout will be 2 of each. auto socketImpl = make_shared<AbstractSocketImpl>(address); //socketImpl.ref_count + 1, now socketImpl.ref_count == 1 auto inputstream = socketImpl->getInputStream(); //socketImpl.ref_count + 1 (caused by inputstream.socket_impl), now socketImpl.ref_count == 2. //inputstream.ref_count == 2, because inputstream holds one reference and inputstream.socket_impl.inputStreamPtr holds inputstream.  The relationship between AbstractSocketImpl and SocketInputStream should be Composition, which SocketInputStream is initialized from an AbstractSocketImpl. There is no meaning for any operations of SocketInputStream when AbstractSocketImpl is destoried. I was incorrectly making strong reference from AbstractSocketImpl to SocketInputStream, which I was trying to manage the memory life-cycle of SocketInputStream from AbstractSocketImpl and make them high coupling. The SocketInputStream is created by AbstractSocketImpl and my mistake was trying to use a shared_ptr (strong reference) to trace the SocketInputStream object. But in fact, SocketInputStream object is managed by user not AbstractSocketImpl, AbstractSocketImpl‘s responsibility is to make sure creating only ONE SocketInputStream object, but NOT recycling it. So AbstractSocketImpl do not need to HOLD a SocketInputStream, it only need to peek whether the SocketInputStream object it created is still alive, if yes, return the pointer, if not, create the new one from itself. So I use weak_ptr to have weak reference to SocketInputStream. Following is the implementation changes: typedef weak_ptr<InputStream> InputStreamWeakPtr; typedef weak_ptr<OutputStream> OutputStreamWeakPtr; class AbstractSocketImpl : public AbstractSocket { InputStreamWeakPtr wkInputStreamPtr; OutputStreamWeakPtr wkOutputStreamPtr; } InputStreamPtr AbstractSocketImpl::getInputStream() { //return make_shared<SocketInputStream>(shared_from_this()); if ( wkInputStreamPtr.expired() ) { InputStreamPtr inputstrPtr = make_shared<SocketInputStream>(shared_from_this()); wkInputStreamPtr = inputstrPtr; return inputstrPtr; } return wkInputStreamPtr.lock(); }  ## Calculate Fibonacci At C++ Compile Time – C++ Template metaprogramming I was asked a question about calculate the Fibonacci number at the compile time not at run time in C++. Initially, I have no idea about how to solve this problem, how to make the calculation happen in the compile time? Then the key solution is: Using Template. Template metaprogramming can make compile-time class generation, and also can perform polymorphism in a static, which is well-known as the Curiously Recurring Template Pattern (CRTP). So the solution is listed as following: template<int N> class Fibonacci { public: enum { value = Fibonacci<N-1>::value + Fibonacci<N-2>::value }; }; template<> class Fibonacci<1> { public: enum { value = 1 }; }; template<> class Fibonacci<0> { public: enum { value = 0 }; }; int main() { int i = Fibonacci<6>::value; return i; }  Compile it to assembly “g++ -O2 -S Fibonacci_template.cpp _main: ## @main .cfi_startproc ## BB#0: pushq %rbp Ltmp0: .cfi_def_cfa_offset 16 Ltmp1: .cfi_offset %rbp, -16 movq %rsp, %rbp Ltmp2: .cfi_def_cfa_register %rbp movl$8, %eax
popq	%rbp
retq
.cfi_endproc


You can find: The assembly output (Line 12) is compiled into the exact number of Fibonacci(6) is to be 8.

## Use shared_ptr inheritance rightly when design and use interfaces

If we want to use polymorphism in C++, normally we need to use a base interface pointer to point to an implemented class object. It is quite easy to use in normal pointer in C++. But if we use smart pointer, shared_ptr with the base class type to hold an implemented object, things going to be a little complicated.

Firstly I think it may easy going like this way:

//Point Base pointer to an implemented Derrived object
//Class "Derived" inheritances from class "Base"
typedef std::shared_ptr<Base> BasePtr;
BasePtr baseptr(new Derived());
//Then call baseptr->operations


## Principle

The principle of using smart pointer to prevent memory leak is always use a named smart pointer variable to hold the result of new

But one thing need to be noted very carefully is that, you cannot use more than ONE shared_ptr to hold the same result of new:

int* ptr = new int;
shared_ptr<int> p1(ptr);
shared_ptr<int> p2(ptr); //logic error


Because each time we construct a shared_ptr object, the code will maintain 2 pointers in the object:
1. Type <T*> pointer to the object you new in the heap;
2. A “Smart Area” sp_count which holds the reference count of all the shared_ptr objects which hold the <T*>;

Each time you use copy constructor or use operate=, shared_ptr will make the reference count maintenance to plus 1 or minus 1; So if there are 2 shared_ptr objects hold the same T*, the pointer <T*> will be delete twice.

This is same when we deal with this pointer. But there is a solution: use shared_from_this():

#include <memory>
#include <iostream>

struct Good: std::enable_shared_from_this<Good>
{
std::shared_ptr<Good> getptr() {
return shared_from_this();
}
};

{
}
};

int main()
{
// Good: the two shared_ptr's share the same object
std::shared_ptr<Good> gp1(new Good);
std::shared_ptr<Good> gp2 = gp1->getptr();
std::cout << "gp2.use_count() = " << gp2.use_count() << '\n';

// Bad, each shared_ptr thinks it's the only owner of the object
std::cout << "bp2.use_count() = " << bp2.use_count() << '\n';
} // UB: double-delete of Bad


## Interface Design

When I was trying to design an AbstracktSocket, which can return Socket I/O Streams, users can use I/O Streams smart_ptr to receive/send message through socket, just like the “Java way”:

AbstractSocketImpl implements the interface AbstractSocket, it has the getInputStream() and getOutputStream(), which will return the SocketInputStream and SocketOutputSteam. But AbstractSocketImpl holds shared_ptr of InputStream and OutputStream which implemented from AbstractSocket. SocketInputStream and SocketOutputSteam are constructed by passing AbstractSocketImpl smart_ptr into their Constructors. So when AbstractSocketImpl initialize the Socket I/O Streams, it will share this pointer. To use shared_ptr rightly, we need make AbstractSocketImpl inherit from std::enable_shared_from_this:

InputStreamPtr AbstractSocketImpl::getInputStream()
{
if ( !inputStreamPtr )
{
inputStreamPtr = make_shared<SocketInputStream>(shared_from_this());
}
return inputStreamPtr;
}

OutputStreamPtr AbstractSocketImpl::getOutputStream()
{
if ( !outputStreamPtr )
{
outputStreamPtr = make_shared<SocketOutputStream>(shared_from_this());
}
return outputStreamPtr;
}


You may notice that the inputStreamPtr is a shared_ptr<InputStream> type, but make_shared creates a shared_ptr<SocketInputStream> object. They are not consistent, but the compiler does not returns any error on GNU Compiler and Microsoft Windows Compiler on C++11. There is a conservative way to convert the smart pointer, by using static_pointer_cast<T> or dynamic_pointer_cast<T>:

inputStreamPtr = static_pointer_cast<InputStream>( make_shared<SocketInputStream>(shared_from_this()) );


### Single-Inheritance

I have some concern about making AbstractSocketImpl inherit from std::enable_shared_from_this, why not make AbstractSocket inherit from std::enable_shared_from_this cause AbstractSocketImpl already inherits from AbstractSocket. So how to deal with the shared_from_this()? Cause the template types are different between AbstractSocket and AbstractSocketImpl.
The Solution is following:

class AbstractSocket : boost::noncopyable, public enable_shared_from_this<AbstractSocket> { ... }

class AbstractSocketImpl : public AbstractSocket
{
std::shared_ptr<AbstractSocketImpl> shared_from_this()
{
return std::static_pointer_cast<AbstractSocketImpl>(AbstractSocket::shared_from_this());
}
}


Once using enable_shared_from_this, The object must be created in Heap, NOT in Stack. Because the weak_ptr in enable_shared_from_this should be initialized. Any pointer created in Stack wrapped in shared_ptr will cause Wrong Memory Access:

// AbstractSocketImpl socketImpl(address);  //----> This is NOT right!
InputStreamPtr inputstream = socketImpl->getInputStream();
OutputStreamPtr outputstream = socketImpl->getOutputStream();


### Multiple-Inheritance

There is a topic on Stackoverflow, wich describes the correct usage of multiple inheritance from enabled_share_from_this.

## C++11 Lambda Expressions

The first time I saw this kind of expression, I feel so strange with it:

file_buffer<uint8_t>::open(outputFileName, std::ios::out).then([=](streambuf<uint8_t> outFile) -> pplx::task<http_response>
{
*fileBuffer = outFile;

// Create an HTTP request.
// Encode the URI query since it could contain special characters like spaces.
http_client client(U("http://www.bing.com/"));
return client.request(methods::GET, uri_builder(U("/search")).append_query(U("q"), searchTerm).to_string());
})


So what does [=] (typename pram) -> typename { } exactly mean?

It is Lambda expression in C++11. A lambda expression represents a callable unit of code. It can be thought of as an unnamed, inline function. Like any function, a lambda has a return type, a parameter list, and a function body. Unlike a function, lambdas may be defined inside a function. A lamba expression has the form:

[capture list] (parameter list) -> return type { function body }

There is a detailed description of Lambda Expression Syntax on MSDN and CPPReference , I will not explain the syntax of Lambda Expression, I would like to introduce my understanding and usage of Lambda Expression.

As my understanding, Lambda Expression creates an Object of an Unnamed Functor (NOT a Function).

A functor is pretty much just a class which defines the operator(). That lets you create objects which “look like” a function (Stackoverflow):

// this is a functor
int operator()(int y) { return x + y; }

private:
int x;
};

// Now you can use it like this:
int i = add42(8); // and "call" it
assert(i == 50); // and it added 42 to its argument

std::vector<int> in; // assume this contains a bunch of values)
std::vector<int> out;
// Pass a functor to std::transform, which calls the functor on every element
// in the input sequence, and stores the result to the output sequence
assert(out[i] == in[i] + 1); // for all i


As we have functor so, why do you need Lambda?

I think one important feature of Lambda is, it can create Anonymous Object, it can be Run On Defined.

Java programmers must be familiar with the code when they create an anonymous class, such expression was not supported in C++.  But Lambda express in C++11 can make a similar achievement. Java Programmers can define an anonymous Thread class:

public class A {
public static void main(String[] arg)
{
{
public void run() {
System.out.println("blah");
}
}.start();
}
}


C++ now can directly pass a Lambda express into a function call, cause it just bass an object into that function. The grammar is different from anonymous class in Java:

void fillVector(vector<int>& v)
{
// A local static variable.
static int nextValue = 1;

// The lambda expression that appears in the following call to
// the generate function modifies and uses the local static
// variable nextValue.
generate(v.begin(), v.end(), [] { return nextValue++; });
//WARNING: this is not thread-safe and is shown for illustration only
}


Programmer can directly pass a functor object with the function body expressions into a parameter, the code will run on define.

## Reverse Bits

https://leetcode.com/problems/reverse-bits/

Reverse bits of a given 32 bits unsigned integer.

For example, given input 43261596 (represented in binary as 00000010100101000001111010011100), return 964176192 (represented in binary as00111001011110000010100101000000).

If this function is called many times, how would you optimize it?

This question is much related with Number of 1 Bits

Solution 1:

class Solution {
public:
uint32_t reverseBits(uint32_t n)
{
uint32_t i;
uint32_t value = 0;
for (i = 0; i < 32; ++i)
{
uint32_t tmp = (uint32_t)(n & ((uint32_t)1 << (31 - i))) ? 1 : 0;
value |= tmp << i;
}
return value;
}
};

Solution 2:

uint32_t reverse(uint32_t x)
{
x = ((x >> 1) & 0x55555555u) | ((x & 0x55555555u) << 1);
x = ((x >> 2) & 0x33333333u) | ((x & 0x33333333u) << 2);
x = ((x >> 4) & 0x0f0f0f0fu) | ((x & 0x0f0f0f0fu) << 4);
x = ((x >> 8) & 0x00ff00ffu) | ((x & 0x00ff00ffu) << 8);
x = ((x >> 16) & 0xffffu) | ((x & 0xffffu) << 16);
return x;
}

## Number of 1 Bits

https://leetcode.com/problems/number-of-1-bits/

Write a function that takes an unsigned integer and returns the number of ’1′ bits it has (also known as the Hamming weight).

For example, the 32-bit integer ’11′ has binary representation 00000000000000000000000000001011, so the function should return 3.

The normal solution is mostly like the following:

class Solution {
public:
int hammingWeight(uint32_t n)
{
unsigned int count = 0;
while(n)
{
count += n & 1;
n >>= 1;
}
return count;
}
};


When searching from the stackoverflow, there is an interesting solution:

This is known as the ‘Hamming Weight‘, ‘popcount’ or ‘sideways addition’.

The ‘best’ algorithm really depends on which CPU you are on and what your usage pattern is.

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. The parallel instructions will almost certainly be fastest, however, the single-instruction algorithms are ‘usually microcoded loops that test a bit per cycle; a log-time algorithm coded in C is often faster’.

A pre-populated table lookup method can be very fast if your CPU has a large cache and/or you are doing lots of these instructions in a tight loop. However it can suffer because of the expense of a ‘cache miss’, where the CPU has to fetch some of the table from main memory.

If you know that your bytes will be mostly 0’s or mostly 1’s then there are very efficient algorithms for these scenarios.

I believe a very good general purpose algorithm is the following, known as ‘parallel’ or ‘variable-precision SWAR algorithm’. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

int NumberOfSetBits(int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

This is because it has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it.

References:

http://graphics.stanford.edu/~seander/bithacks.html

http://en.wikipedia.org/wiki/Hamming_weight

http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/

http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count)