**Problem formulation**
You have a dataset in which each observation is an impression of a banner ad. You have a variable indicating success (say, click or conversion) and a lot of additional information (features), all of which are coded as binary: browser type, host, URL, placement, banner format, banner id, banner keywords, hosts the viewer has seen so far, how many times the viewer has seen particular banners, when did s/he see these banners, how many times did s/he click on which banners, how did s/he behave while shopping online, current date and time, geoid information, and so on.

The question is which features increase the chance of success and which decrease it? This is an important question if you want to allocate your advertising resources efficiently. The difficulty is that the number of observations is in billions and the number of available features is in millions.

**A solution**
A naïve approach is to create a table which informs how many successes and how many failures occurred when feature was present or absent. Then, you can compare success ratio in absence of the feature with the success ratio in presence of the feature. If the latter is higher than the former, then the feature indicates higher probability of success.

This approach is similar to calculating simple correlation between the feature and the success indicator. And thus, it suffers from

endogeneity. If a combination of two features often occurs together, say a particular host and a particular banner, and both of them seem to have high correlation with the success indicator, you do not really know whether this is the banner that drives success, the host, or both.

In order to separate the effects of features, you need to calculate partial correlation, conditional on other features, rather than simple correlation. The straightforward way to do it is to perform an

ordinary least squares regression on the data. Unfortunately, there exists no software that could handle amounts of data you have. Even if you limit the dataset to most common features – say top 5000 – you still end up with several terabytes of data to be processed by the regression algorithm. To focus attention, let us say that we need a way to perform a regression on n = 4 billion observations and k = 10 thousand features. If each variable takes up 4 bytes, the amount of memory required to perform such analysis equals nearly 160 terabytes.

Typically, linear least squares models are fit using

orthogonal decomposition of the data matrix. For example,

R package uses QR decomposition. One can use also

singular value decomposition. Unfortunately, these methods require all data to be kept in memory and have algorithmic complexity of O(nk

^{2}).

Alternatively, one can calculate

Gram matrix. This has algorithmic complexity of O(nk

^{2}) which can be reduced to O(np

^{2}) if the data are sparse (where p is the quadratic mean number of features per observation) and very easily parallelized. Another advantage is that memory requirement for calculating Gram matrix are O(k

^{2}) only and for k = 10000 the exact amount of RAM required to keep Gram matrix would be just under 200 MB (keep in mind that Gram matrix is symmetric). The only problem here is that to calculate regression coefficients, it is necessary to invert the calculated Gram matrix (which is often discouraged due to inferior numerical stability and takes O(k

^{3})). The viability of this solution depends thus on whether it is possible to do it with satisfactory numerical accuracy. As it turns out, it is.

Note that popular machine learning engines like

Vowpal Wabbit are not of much use in this situation. Machine learning is usually concentrated on prediction, rather than accurate estimation of model parameters. Engines like VW in principle are less accurate than OLS. They allow multi-collinearity of variables which in turn forces user to perform separate data analysis in order to eliminate it in the first place. Finally, they do not allow for standard statistical inference with the model parameters.

**Preliminaries**
The plan was to create a C++ class able to do all operations necessary for this regression. The data were stored on a remote Linux server using

Hadoop. I was planning to develop and debug my solution using Microsoft Visual Studio 2015 on my Windows 7 64-bit Dell computer (i7-4790 @ 3.6 GHz with 16 GB RAM) and then to port it to its final destination.

There were four initial things I had to take care of: (1) a way of measuring code performance, (2) a way of measuring numerical accuracy of matrix inversion, (3) C++ libraries for inverting matrices, and (4) a strategy for verifying accuracy of the entire algorithm.

Boy, was it hard to find a good way to precisely measure code execution time on Windows. Unfortunately, the usually recommended

GetTickCount() Windows API function relies on the 55 Hz clock and thus has a resolution of around 18 milliseconds. Fortunately, I eventually found out about the

QueryPerformanceCounter() function, whose resolution is much better.

Next, I decided to use the following measure for numerical precision of matrix inversion. Let us say that you need to invert matrix A. You use an inversion algorithm on it which generates matrix B. If matrix B is a perfect inverse of A, then AB = I, where I is the identity matrix. Hence, I calculate matrix C = AB – I. Then, I find the element of matrix C that has the highest absolute value and call it r. This is my measure of numerical precision. In the world of infinite precision, r = 0. In the real world r < 1e-16 is perfect (I use double – a 64 bit floating point type for my calculations). r < 1e-5 is still acceptable. Otherwise there are reasons to worry.

With tools for measuring performance and accuracy, I was able to start testing libraries. I initially turned to

Eigen which was very easy to install and use with my Visual Studio. Eigen uses

LU decomposition for calculating matrix inverse and was satisfying in terms of speed and reliability – up to the point when I tried to invert a 7000x7000 matrix. Eigen kept on crashing and I could not figure out why. The second option was thus

Armadillo. Armadillo did not have the same problems and worked well with bigger matrices all the way up to 10000.

As it turns out, Armadillo can take advantage of the fact that Gram matrix is symmetric and positive-definite. The inversion is done by means of

Cholesky decomposition and after a few experiments I realized that it is not only faster but also numerically more reliable than LU-based method. I was able to invert a 10001x10001 matrix in 283 seconds (in a single thread) with r = 3.13e-14. The irony is that both Cholesky decomposition and matrix multiplication work in O(k

^{3}) but the latter is over twice as slow, so it takes much more time to check numerical precision than to perform actual inversion.

Finally, I designed a data generating process to test whether least squares algorithm of my design can recover parameters used to generate the data. Essentially, I created 10001 variables x

_{i} for i=0, 1, 2, …, 10000. x

_{0} = 1, always. For i>0 we have P(x

_{i} = 1) = 1/(3+i) = 1 – P(x

_{i} = 0). Then, I created a vector of parameters b

_{i}. b

_{0} = 0.0015 and for any non-negative integer j, b

_{4j+1} = 0.0001, b

_{4j+2} = 0.0002, b

_{4j+3} = 0.0003, and b

_{4j+4} = -0.00005. Finally, P(y = 1) = x * b, where * indicates

dot product. This is a typical

linear probability model.

Using the formula above I generated 4 billion observations (it took 11 days on 4 out of 8 cores of my Windows machine) and fed them into the regression algorithm. The algorithm was able to recover vector b with the expected convergence rate. Note that by design the aforementioned data generating process creates variables that are independently distributed. I thus had to tweak this and that to see whether the algorithm could handle correlated features as well as to investigate

the bias (see more about that in the last section).

**Statistical inference**
The question of how to recover model parameters from the data is simple. In addition to the Gram matrix, you need a success count vector. The i-th element in this vector indicates how many successes were there when i-th feature was present. Calculating this vector is at most O(np) in time and requires O(k) memory (note that none of the operations involved in calculating Gram matrix and success count vector are floating point operations – this is all integer arithmetic since we operate on binary variables only; thus both Gram matrix and success count vector are calculated with perfect numerical precision). Once you have them both, you need to invert the Gram matrix and multiply it by the success count vector. The resulting vector contains estimated model parameters.

However, getting standard errors of the estimated coefficients is a bit more complicated. Typically, we would use diagonal elements of the inverted Gram matrix and multiply them by standard deviation of the residuals. The problem is that calculating residuals requires going through all observations all over again. This not only increases the execution time. It poses a major technical difficulty as it requires the dataset to be invariant for the duration of the algorithm execution (which is assumed to be at least several hours). To fix this, one would have to tinker with the data flow in the entire system which can greatly inflate project’s costs.

Fortunately, there is trick that can rescue us here. Instead of quadratic mean of residuals, one can use standard deviation of the success variable. Note that the latter must be greater than the former: the former is the

quadratic mean of residuals for the entire model and the latter is the quadratic mean of residuals for the model with a constant only. This guarantees that the standard errors will be overestimated which is much better than having them underestimated or all over the place. Moreover, for small average success ratio, the two will be close. In fact, it is easy to show that under some plausible conditions as the average success ratio goes to zero, the two are the same in the limit. And for banner impressions, the average success ratio (e.g.

CTR) is, no doubt, small.

No amount of theoretical divagations can replace an empirical test. It is thus necessary to check ex post whether statistical inference using the above simplifications is indeed valid. To do that, I estimate a number of models (keep in mind that I have 10000 variables) and check how frequently the estimated coefficients are within the 95% confidence intervals. I expect them to be there slightly more often than 95% of the time (due to overestimation of standard errors) and indeed, this is what I find.

Finally, I cannot write a section about statistical inference without bashing

p-values and

t-statistics. I strongly discourage you from using them. A single number is often not enough to facilitate good judgment about the estimated coefficient. p-value typically answers a question like: “how likely is it, that the coefficient is on the opposite side of zero?” - Is this really what you want to know? The notion of

statistical significance is often misleading. You can have a statistically insignificant coefficient whose confidence interval is so close to zero that any meaningful influence on the dependent variable is ruled out: you can then say that your data conclusively show that there is no influence (rather than that the data do not show that there is influence). Also, you can have a statistically significant coefficient with very high t-statistic, which is

economically insignificant or economically significant but estimated very imprecisely. Thus, instead of p-values and t-statistics I suggest using confidence intervals. The question they answer is: what are the likely values of the coefficient? And this is what you actually want to know most of the time.

**Data refinements**
Oops. You have a nice OLS algorithm which supports valid statistical inference. You tested it with you generated data and it works fine. Now you apply it to real data and the Gram matrix does not want to invert or inverts with precision r > 1. You quickly realize that it is because the data have a lot of constant, perfectly correlated, and

multicollinear variables. How to deal with that?

Sure, you can force users to limit themselves only to variables which are neither perfectly correlated nor multi-collinear. But when they are using thousands of variables, it may take a lot of effort to figure it out. Also, running an algorithm for several hours only to learn that it fails because you stuffed it with a bad variable (and it does not tell you which one is bad!) simply does not seem right. Fortunately, as it turns out, all these problems can be fixed with analysis and manipulations on the already-calculated Gram matrix.

The first refinement I suggest is dropping features that are present too few times (e.g. less than 1000). You can find them by examining diagonal entries of the Gram matrix. To drop a variable you can just delete appropriate row and column from the Gram matrix as well as corresponding entry form the success count vector. After such a delete operation, what you are left with is the same as if you did not consider the deleted variable to begin with. Clear cut.

The second refinement I suggest is to drop features with not enough variability. Based on the Gram matrix and the success count vector, it is possible to construct a variability table for every feature (the same one I described as the naïve solution at the beginning of the article). This table has two rows and two columns – rows indicate whether there was a success and columns indicate whether the feature was present. Each cell contains the number of observations. So you have the number of observations that had a feature and there was a success, a number of observations that had a feature but with no success, a number of observation without this feature but with success, and a number of observations with neither feature nor success. I drop features for which at least one of the four cells has a value lower than 10.

As we proceed with the third refinement, note that you can easily calculate correlation between any two features based on the content of Gram matrix. Just write out the formula for correlation and simplify it knowing that you are dealing with binary variables to realize that you have all information you need in the Gram matrix. This of course allows you to identify all pairs of perfectly correlated or highly correlated variables in O(k

^{2}) time. I got rid of a variable if I saw correlation whose absolute value exceeded .99 (doing, say, .95 instead of .99 did not dramatically improve speed or numerical precision of the algorithm).

But now comes a biggie. How to find features that are perfectly multicollinear? One naïve approach is to try to find all triples of such variables and test them for multicollinearity, find all quadruples, quintuples, and so on. The trouble is that finding all n-tuples can be done in time O(k

^{n}) which is a nightmare. Alternatively, you can try to invert submatrices: if you can invert a matrix made up of first p rows and columns of the original Gram matrix, but you cannot invert a matrix made up of the first p+1 rows and columns of the original, it surely indicates that the variable number p+1 causes our Gram matrix to be singular. But this solution has a complexity of O(k

^{4}) which for high k may be very cumbersome. There must be a better way.

As it turns out a better way is to perform a

QR decomposition of the Gram matrix (not to confuse with QR decomposition of the data matrix as a part of the standard linear least squares algorithm). The diagonal elements of the R matrix are of interest to us – a zero indicates that a variable is causing problems and needs to be eliminated. QR decomposition generates the same results as the “invert submatrices” algorithm described above – but it runs in O(k

^{3}). And, of course, it is a good practice to check its numerical precision in a similar way we were checking numerical precision of matrix inversion algorithm.

Finally, note that you can sort the Gram matrix using its diagonal entries. I sort it descending so that features that get eliminated are always the features which occur less frequently. It is probably possible to achieve higher/lower numerical precision by sorting Gram matrix, however I have not investigated this issue extensively. I only noticed that in some instances sorting the Gram matrix ascending made the LU inversion algorithm fail (too high r) while sorting descending or not sorting did not affect the LU algorithm much.

All these operations require some effort to keep track of which variables were eliminated and why, and especially how variables in the final Gram matrix (the one undergoing inversion) map to the initial variables before the refinements. However, the results are worth the effort.

**Integration and application**
The task of integrating new solutions with legacy systems may be particularly hard. Fortunately, in my case, there already existed data processing routines that fed off of the same input I needed (that is a stream of observations in a sparsity supporting format – a list of “lit-up” features), as well as input generating routines that filtered original data with given observation and feature selectors.

I had a

shared terminal session using Screen with people responsible for maintaining C++ code for analysis done on these data to-date. We were able to link up my class within the current setup so that users can use the same interface to run the regression that they used previously to do other type of analyses. Later on, I had to do some code debugging to account for unexpected differences in data format but ultimately everything went well.

The first real data fed to the algorithm had 1.16 billion observations and 5050 features. Calculating Gram matrix and success count vector took around 7 hours. Due to refinements, the number of features was reduced to 3104. Inverting matrix took just a few seconds, and the achieved precision was around 2e-7.

**Pitfalls**
In this final section I would like to discuss three potential problems that do not have easy solutions: variable cannibalization, bias, and causality.

It often happens that a number of available features refer to essentially the same thing. For example, you may have features that indicate a person who did not see this banner in past minute, 5 minutes, hour, and day. These features will be correlated and they have a clear hierarchy of implication. A user can make an attempt to run a regression using all these features expecting that the chance of success will be a decreasing function of the number of impressions. However, the effect of a viewer who has never seen the banner will not be attributed entirely to any of the aforementioned features. Instead, it will be split among them, making the estimated coefficients hard to interpret. This is the essence of cannibalization – similar variables split the effect they are supposed to pick up and therefore none of them has a coefficient it should have (please let me know if you are aware of a better term than “cannibalization”). The simple but somewhat cumbersome remedy for it is to manually avoid using features with similar meaning in one regression.

Secondly, it is widely known that linear probability model generates bias. The biased coefficients are usually closer to zero than they should be. To see why, consider a feature whose effect is to increase probability of success by 10%. However, this feature often occurs with other feature whose presence drives the probability of success to -25% (that is zero). Presence of the feature in question can at best increase the probability to -15% (that is still zero). As a result the feature in question does not affect the outcome in some sub-population due to negative predicted probability. Its estimated effect is thus smaller (closer to zero) than expected 10%.

Note that the reason why linear probability model generates biased results is not because the regression algorithm is flawed but because the model specification is flawed. The P(y = 1) = x * b model equation is incorrect if x * b is smaller than zero or bigger than one because probability by definition must be between 0 and 1. Whenever x * b is outside these bounds, the coefficients end up being biased. That is, OLS correctly estimates partial correlation between independent and dependent variables, but, due to data truncation, partial correlation is not what is needed to recover the linear probability model parameters.

The resolution of this issue may go towards assuming that model specification is correct and finding ways to alleviate bias or at least towards identifying features whose coefficients may be biased. On the other hand it may be also possible to assume that the linear probability specification is incorrect and to investigate whether partial correlation is what is really needed for the decision problems the estimates are supposed to help with. I consider solving this problem an issue separate from the main topic of this article and I leave it at that.

Finally, I would like to make a note on causality. Partial correlation, as any correlation, does not imply causation. Therefore, it may turn out that a particular feature does not have a causal effect on probability of success but instead is correlated with an omitted variable which is the true cause of the change in the observable behavior. For example, one host can have a higher conversion ratio than the other. However, the reason for that may be that the advertised product is for females only. The population of females may be much smaller for the second host even though higher fraction of them buys the product. In such case the second host is actually better at selling the product (that is it is better to direct generic traffic to the second host rather than to the first one) but this information is obscured by inability to distinguish between male and female viewers. It is thus important to remember that the regression provides us only with partial correlation rather than proofs of causality.

The issue of causality is of extreme importance when we are trying to predict effects of policy (like redirecting traffic in the example above). However, when instead of policy effects, we are interested in predictions, partial correlation seems to be a sufficient tool. For example, you may want to know whether people using Internet Explorer are more likely to click on a banner, even though you do not have the ability to influence what browser they are using. In such situations establishing causality is not necessary.

:-)