Thursday, March 17, 2016

Persecution of homosexuals

It is just a hypothesis.

Because they invest much more resources in giving birth and rearing children, women are generally pickier than men when choosing their sexual partners. They are harder to get than men.

Men on the other hand have high libido that motivates them to compete for women and to pursue them, often despite their reluctance.

Men could easily release the sexual tension if they could have sex with each other instead of women. However such behavior would lead to lower chances for reproduction. So, the evolution came up with a solution: man feel repelled when thinking about having sex with other man.

In other words, males have an innate revulsion towards sexual interaction with other men (i.e. gay sex). Otherwise they would be releasing their sexual tension by having sex with each other instead of pursuing women. I suspect this mechanism has been around for many million years. It was created in our ancestral animals (which could not masturbate) and lingers on since then.

Women on the other hand do not have to compete or pursue men as vigorously, and so their everyday libido is at a lower level. They do not need the protection against the possibility of releasing their sexual tension in an evolutionarily disadvantageous way – because they do not have as much tension. As a result, they do not have the same innate revulsion towards lesbians as men have towards gays.

Also, according to this hypothesis, men do not have reasons to feel revulsion towards lesbian sex and women do not have reasons to feel revulsion towards gay sex. It seems that these predictions more or less conform to reality.

In the end, the discrimination of gays comes from the fact that men are naturally repelled by gay sex for evolutionary reasons. The reason why this sometimes extends to the entire society being against both gays and lesbians is trough guilt by association often institutionalized by various cultural vehicles like religious scriptures.

Literature on attitudes towards homosexuals

Monday, March 7, 2016

Battle over e-cigarettes

I was recently asked to create a report on e-cigarettes by a friendly company. Some of the workers started complaining about secondhand e-cigarette vapor. The discussion got heated up because five pregnant women were working at the open space office where the vapers were located. The HR did some preliminary research and sent out an email with new rules. They said that e-cigarettes were more carcinogenic than traditional cigarettes and that WHO advised to prohibit vaping at workplace because of the threat of secondhand vapor. Vaping got banned.

This resulted in a backlash and some vapers ended up writing long emails on how this is all untrue. They cited sources like NHS as well as scientific papers supposedly debunking claims made by HR. Nevertheless the management decided that the ban on e-cigarettes was here to stay. In the meanwhile they asked me about my opinion.

In a nutshell, e-cigarettes are most likely much less harmful than traditional cigarettes. They do not contain most harmful substances released by burning tobacco. The amounts of other supposedly harmful substances in the secondhand vapor may be lower than in other products happily ingested or inhaled by people (including pregnant women). There exist no scientific proof that e-cigarettes are toxic to bystanders.

E-cigarettes are indeed a different product than traditional cigarettes and the comparisons of the two are unjustified. Their sin is that they share the name, a single ingredient (nicotine), and the consumption looks similar to that of the super-villain traditional cigarettes. Guilt by association on one hand and the interest groups on the other hand contribute to fearmongering about these products. As a result, e-cigarettes in the eyes of public are guilty until proven innocent.

Nevertheless, there is a small risk that the exhaled vapor contains amounts of nicotine high enough to damage or otherwise hamper the development of a fetus. Pregnant women should have the right to be protected from exposure to vapor, until research shows that e-cigarettes are safe for fetal development. In general, regular adults should have the right to breathe air that is not contaminated by emissions of other people, be it cigarette smoke, scented candles, body odor, farts, or vapor. In the same way visual and sonic pollution of common spaces are often limited.

In the context of office environment, vaping should be thus prohibited if coworkers strongly request so. Since there is currently no definite scientific evidence either way, a firm can decide to do whatever maximizes the profit. If vaping reduces productivity by making workers unhappy, it should be banned. If it increases productivity by eliminating cigarette breaks, then it should be allowed. As for the litigation, it is unlikely that the plaintiff can succeed in suing a company for exposing them to a secondhand vapor because there is no scientific proof that it is harmful. It is more likely that vapers succeed in suing a company that confined them to a space occupied by regular smokers, since the tobacco smoke is proven to be harmful.

All these ideas will be developed in detail in what follows. Let us start with nicotine.


Nicotine is addictive. This in itself is not bad for the health of an individual. The only well-established risk associated with intake of nicotine is increased chance of developing cardiovascular disease. But, since nicotine is the addictive component of tobacco, it is the sworn enemy number one of many health officials. Thus, much more effort is done to prove its harmfulness than to disprove it or to prove its benefits. Scientific papers on benefits of nicotine always underline that it is also harmful but scientific papers on its disadvantages (much more numerous) hardly ever mention any benefits at all. This is a signal that scientific community is not objectively analyzing the issue. Nevertheless, despite all the effort, few adverse effects of nicotine are confirmed.

There is a number of theories linking use of nicotine to malfunction of various systems in the body. Most notably, nicotine is often linked to cancer or fetal problems. These two claims are however made up on the basis of guilt by association – they are proven results of smoking which involves inhalation of many other substances. Clinical trials with nicotine replacement therapy do not confirm any of these claims. As one influential study says: “The safety of NRT in terms of effect on fetal development and birth outcomes remains unclear in pooled data from this review.”

The current scientific knowledge indicates that if the dose is right, the effects of nicotine on human body are similar to those of caffeine. The only difference is that nicotine is more addictive.

The difference between cigarettes and e-cigarettes

Smoking cigarettes involves inhaling smoke generated during combustion of tobacco leaves. Some studies try to prove that chemical composition of tobacco makes its smoke especially carcinogenic, but the truth is that any regular smoke has a lot of carcinogenic substances. Smoked food has been identified as a cause for cancer (fried food as well). Any other smoke, including smoke from chimneys, cannabis, scented candles, or car exhaust fumes (especially diesel) also contains carcinogens. Two main factors influence how carcinogenic it is: (1) dirtiness of the smoke – low temperatures and limited oxygen can contribute to incomplete combustion leaving many reactive particles intact and (2) the amount of smoke inhaled – definitely higher in case of smoke produced exclusively for the purpose of inhaling it multiple times a day.

On the other hand, the aerosol (often called vapor) generated by e-cigarettes is not a result of combustion. A part of e-cigarette called atomizer increases the temperature of the e-liquid causing it to turn into aerosol. As a result, there is no change in the chemical composition of the ingredients (unless there is contamination).

Cigarette smoke contains thousands of different substances, many of which are identified as harmful, either as carcinogens or otherwise. E-cigarette aerosol do not contain these substances. Chemical composition of regular cigarette smoke is much closer to the smoke generated by a scented candle or a chimney than to vapor generated by e-cigarettes. It is thus silly to automatically associate e-cigarettes with regular cigarettes because they are used similarly and they have a similar name. They are completely different products with highly separated sets of advantages and disadvantages.

Potential harms of e-cigarettes

Looking for negative effects of e-cigarettes, it is easy to come across lists including items such: (1) lithium-ion batteries of e-cigarettes sometimes explode causing burns, (2) e-liquid can cause poisoning in children who drink it, or (3) adults sometimes confuse e-liquid with another product like eye drops which also causes poisoning. These are ridiculous claims that are not idiosyncratic to e-cigarettes but can be applied to any electronic or chemical product. They are found on such lists not because e-cigarettes are especially prone to such accidents but because anti-e-cigarettes activists try to make them look scarier (this tactic may have an adverse effect as it undermines credibility of the authors in the eyes of a skeptic reader). For the purpose of this analysis I will focus only on the aspects related to inhaling vapor.

The ingredients of e-liquid are most often: propylene glycol, glycerol, water, and nicotine. The first two compounds are widely used in the food industry and are proven to be safe for consumption by humans including pregnant women. In addition, e-liquid often contains flavorings and other additives that depend on the brand of the product. These need to be checked individually for potential adverse effects in the same way flavors and other additives to food products need to be checked.

Some studies indicate that e-liquid may get contaminated and then vapor contains other potentially toxic chemicals, for example heavy metals. It is however important to remember that most of these potentially harmful substances are present virtually everywhere. For example, lead is present both in the air at the top of Mount Everest and in the seawater of Mariana Trench (as a curiosity, lead in seawater is three times as common as gold but a hundred time less commont han uranium). Just the fact that the substance is present does not matter. It matters only if the concentration is high enough to affect human health. And no study has shown that vapor contains harmful quantities of contaminants.

There does not seem to be anything in e-cigarettes that justifies the outcry and fearmongering about its potentially harmful ingredients.

Why people are so wary of e-cigarettes then?

There are four main articles on Wikipedia about e-cigarettes: (1) Electronic cigarette, (2) Safety of electronic cigarettes, (3) Electronic cigarette aerosol and e-liquid, and (4) Positions of medical organizations on electronic cigarettes. Health-related parts of these articles look like battlefields. They consist of intertwined positive and negative statements debunking each other. Almost every sentence has references to scientific sources. The irony is that virtually none of these statements prove anything. Words like “may” or “can” are much more common than “is” or “do.” The overall message (often explicitly expressed at the beginning of an article) is that nobody knows anything. These Wikipedia articles are thus the longest, most elaborate and well-sourced but also entirely meaningless and pointless texts one can ever imagine.

What are the forces that created these battlefields? In one corner there are vapers (who want to vape) and the vaping industry (who want to make profit). In the other corner are concerned citizens, most health organizations (with the notable exceptions of British ones) and politicians. Surprisingly, Big Tobacco – the most obvious potential opponent to e-cigarettes as they constitute competition – does not seem to be a driving force. To the contrary, the Big Tobacco slowly tries to diversify by investing in the e-cigarette industry.

Consider California. Master Settlement Agreement – a deal between 46 US states and the major tobacco companies – provides California (and other states) with a steady stream of money intended to cover medical expanses caused by tobacco use. The source of money are tobacco companies and the amount depends on their sales (a few cents per cigarette). Like other states, California decided to securitize the future payments in order to get more money upfront. This resulted in creation of so called tobacco bonds. Government issues these bonds and the buyers gets repaid with the money government receives over time from tobacco companies.

When the sales of cigarettes are too low and the amount of money is not enough to repay the debt, the bonds are in default. However, since the revenues will continue as long as cigarettes are being sold, the default often means that the creditors will eventually receive their money, although later. Some states (including California) chip in their backing to boost creditworthiness of tobacco bonds. That is they promise they will repay creditors using tax receipts if the money from tobacco companies is not enough.

Because state revenues depend on the amount of cigarettes sold, states have incentive to maintain this source of revenue by, say, banning or taxing e-cigarettes. In addition, securitization created a lobby – people who bought tobacco bonds – whose interest is in maximizing the number of regular cigarettes sold. And the amounts of money we are talking about are not small – by 2007 California has emitted nearly $17 billion worth of tobacco bonds. Various statesare working on banning or taxing e-cigarettes comparably to regular cigarettes, including California. Members of California’s tax commission shamelessly spread bullshit in the documents calling for higher taxation: in a 2015 report Ms. Fiona Ma, in addition to making statements already debunked in this article, writes that: “Tobacco companies claim that E-cigarettes are not as harmful as conventional cigarettes (…). However, these claims are refuted by strong scientific evidence that claims that E-cigarettes can be just as harmful as conventional cigarettes.” What evidence? Unfortunately no sources are provided.

Health organizations

But the worst source of misinformation and fearmongering are health officials and health activists. Let us analyze the 2015 report signed by the then director of the California Department of Public Health. A long litany of concerns regarding e-cigarettes starts with indications than they are more and more popular, especially among young people and many of these young people never smoked cigarettes. These are all true facts. But they are not obviously bad. They would be bad if the overall health of population was declining due to growing trend in consumption of e-cigarettes. And nobody was able to prove that so far. The fact that adolescents who never smoked tobacco use e-cigarettes does not mean anything: maybe these individuals would use regular tobacco products instead, if e-cigarettes were not available.

The report states: “Research suggests that kids who may have otherwise never smoked cigarettes are now becoming addicted to nicotine through the use of e-cigarettes and other e-products.” And then there is a reference to a scientific paper in which we read: “This is a cross-section study, which only allows us to identify associations, not causal relationships.” That is, the study itself says that they do not claim that e-cigarettes cause increase in addiction to nicotine. Whether the authors of the report intentionally lied is hard to say – it seems more likely that they did not read the paper they were citing and their conclusions were influenced by strong confirmation bias. In a nutshell: they did not lie, they were just lazy and biased.

Some additional statements in the report include:
  • Nicotine is highly addictive neurotoxin, especially in adolescents. The report fails to specify that it indeed is harmful to adolescent rats when administered through injection. Trials in humans do not confirm these claims.
  • Vapor is a concoction of toxic chemicals, at least ten of which are known to cause cancer, birth defects, or other reproductive harm. This statement can be equally truthfully made about many types of food we eat daily, e.g. french fries.
  • E-liquid may be confused by children who would eat it; e-cigarettes sometimes leak which can lead to poisoning when e-liquid is ingested or are used as an eye-drop by mistake. Yes, misuse happens with every product. But does it happen more often with e-cigarettes?
  • Claims that e-cigarettes help to quit smoking are unproven. But switching from smoking to vaping precisely is quitting smoking, isn’t it?
  • E-cigarettes are undermining current smoke-shaming norms and provide a way around smoking bans. True, only if you equate smoking with vaping in your mind. Otherwise false.

The overall picture that emerges is as follows. Health officials are under influence of several forces that make them such strong opponents of e-cigarettes:
  • They are often incompetent, biased, and lack critical thinking (traditional human characteristic). They engage in herd behavior – if so many people around me say it, it must be true (traditional human characteristic). They are lazy and do not check other people, especially not those whom they agree with upfront (traditional human characteristic).
  • Their job is to protect population from any health risks (also those overblown or imaginary). They take their task of policing other people and telling them what to do too seriously. They often neglecting aspects other than direct impact on health (economic issues, unintended consequences, etc.).
  • Bashing e-cigarettes is popular because it is easy to associate e-cigarettes with regular cigarettes and the latter are proven to be harmful. They respond to pressures: some people demand bashing e-cigarettes, especially fearful parents.
  • Their views are reinforced by other officials, most notably those responsible for tax revenues.
  • They worked hard to brand smoking as bad and e-cigarettes constitute a new trend that in their eyes threatens this achievement. 

To vape or not to vape?

Now that we have established what the facts are and explained the sources of confusions, it is time to make a decision: to vape or not to vape?

A rational individual should consider both advantages and disadvantages of vaping and chose whatever this cost-benefit analysis indicates. This is hard to do in practice because the ideological warfare reduces the quality of available objective information. For example, the research on the positive effects of nicotine is seriously underdeveloped meanwhile the research on its negative effects if overdeveloped and full of exaggerations.

If you are a smoker, than the answer is simple: stop smoking and start vaping. If you are a non-smoker, there are probably better methods of getting benefits nicotine provides: for example, if you need to get focused, you may consider drinking coffee. But if there are no other ways for you to, say, relieve stress, you may try to see if vaping can help you. Remember to consider how addictive your personality is. Some people get addicted much easier than others. If you belong to the former group, it is riskier for you to experiment with nicotine because you may get addicted even if benefits turn out to be not worth the costs.

But the main objective of this article is to advise a company on a policy. Should HR ban vaping in the company buildings? Things to consider when answering this question are summarized in the table below:

Vaping is allowed in common spaces.
Vaping is not allowed in common spaces.
Health effects on employees.
Non-smokers may be exposed to vapor (which has no proven negative health consequences). Also, people who quit smoking may be exposed to nicotine, which may induce relapse.
Vapers may end up vaping in the same areas as smokers which may be unhealthy to them due to secondhand smoke.
Some workers may consider it irritating or may estimate risk of inhaling vapor to be high. These workers may become less productive. Tensions between workers create unproductive working environment for the entire company.
Vapers take multiple breaks a day to go out. Productivity decreases, although probably not as much – breaks are often needed by workers and would happen anyway but in a hidden way, say by sitting idle next to one’s computer.
Possible lawsuit.
Non-smokers can sue a company for forcing them to be in vicinity of vapers, especially if they ended up having health problems, as in the case of traditional cigarettes. A very unlikely scenario.
Vapers can sue company if they can prove that company rules forced them to be in the vicinity of smokers, especially when they have prescription for e-cigarette or if they develop a disease associated with secondhand tobacco smoke but not with vaping. A very unlikely scenario.

In general, productivity issue seems to be most important. If many employees request a ban, one probably should be enacted. Otherwise not.

Thursday, February 18, 2016

Medial news reporting

This is such a bullshit. Let me just glance over the fact that these studies hardly ever establish causality. What they do, is that they establish partial correlation which is as good as control variables they used. In a nutshell, you are more likely to get heart failure with the kind of people who both drink a lot of soda and do not exercise. If the study does not account for the exercising aspect, the rate of heart failure will be attributed only to drinking soda even though it should be attributed to lack of exercising as well. This is so called “omitted variable bias” and biomedical studies are plagued with it. The only way to eliminate it, and the only kind of evidence that should be accepted as definite are randomized controlled trials.

But this is only scientists’ wrongdoing and I want to complain here about how their findings – whether true or not – are twisted and misinterpreted by the public. To see it, let us look at the three news articles mentioned above and the corresponding research articles (which can be found here, here, and here). Let us discuss the 33% stroke figure from the first article. Upon reading the article, it seems that out of around 600,000 people, roughly 1700 had a stroke, which gives us average incidence of less than 0.3%. That is, if you work over 55 hours a day, your chances of getting a stroke are 0.4% instead of 0.3% that you would have if you worked just under 40 hours a day. In fact, your chances of getting a stroke if you work over 55 hours increase by 0.1%, not 33% as the article implies.

Second story: supposedly, drinking at least two sugary beverages a day increases your chances of heart failure by 23%. From the source article, we see that there were around 42,000 men taking part in the study. Around 4100 of them had heart failure. That gives us an average incidence of 9.8%. Drinking a lot of coke increases incidence to 12%. Therefore, your chance of getting heart failure raises by 2.2 percentage points, not by the implied scary 23%.

Finally, let us consider the claim that light drinking increases chances of breast cancer. According to the study, out of around 88,000 women, roughly 19,000 developed cancer. This tells us that an average women in this study had a probability of 21.6% of developing cancer. On the other hand an average drinking women had a probability of 24.4% of developing cancer. Thus, the probability increases by 2.8% which does not seem too much when compared to the overall risk of developing cancer. (My calculations in all the three examples are approximate and based on a number of simplifying assumptions, yet they should illustrate the problem).

So what is going on? What is the source of all these discrepancies? The numbers that are reported by medical journals are relative risks (and the statements about them are technically correct but may be misleading to an untrained person). That is, they tell you how many times a person who is involved in a “risky” behavior is more likely to contract a disease as compared to a person who is not involved in the “risky” behavior. That is, it may be true that a person who works a lot, a person who drinks a lot of soda, and a person who drinks a little alcohol every day are 1.33, 1.23, and 1.13 (respectively) times more likely to contract a disease in question than a person who does not do it. But the probability for the person who does not do it is usually small anyway, so even doubling it often does not increase your overall probability of contracting the disease enough to worry about it.

A related problem has to do with labeling substances as carcinogenic. Usually, anything that increases probability of causing cancer is labelled as carcinogenic. Therefore, a substance that increases your chances of contracting cancer from 15.2% to 15.3% (that is by 0.1% overall) is treated by the media frenzy in the same way as a substance that increases your chances from 7.5% to 38.1% (that is by 30.6% overall). People easily swallow it: to a raging mom dead set on protecting her baby from all harm, something labelled as a carcinogen is pure evil, no matter how strong a carcinogen it is and whether it also has some unrelated health-enhancing benefits.

But this is not all. Even if the numbers were reported correctly, they still would be a bad guidance for making decisions. This is because virtually everything has its advantages and disadvantages. Drinking a glass of wine a day may increase your chances of developing cancer but may also decrease your chances of getting a stroke. Quitting smoking may save you from lung cancer but may also cause you to eat more which will make you obese and increase your chances of heart failure. And so on. The only way to make a good decision is to perform a complete cost-benefit analysis with respect to an ultimate variable of interest. That is, a study is most useful for a person deciding what to do to stay healthy, if it reports whether doing something increases or decreases your expected healthy lifespan (and by how much). Only then you know full potential impact of such a decision on your health and can decide whether changing your habits is worth it. Unfortunately, studies reporting effects on healthy lifespan are hard to come by.

The focus on incidence and disregard for life expectancy is the source of another public misunderstanding of the current state of affairs: the notion of cancer epidemic. It is so scary: the cancer is currently much more prevalent than it was decades ago and is projected to be even more common in future. The world surely is becoming worse place, at least in this one respect, right? Bullshit. The reason why cancer is becoming ever more prevalent is because we are better and better at fighting other diseases. The average lifespan continuously increases because we successively eliminate other causes of death. This leaves cancer (along a few other culprits) to kill those people who did not die from other reasons. To sum up: the cancer epidemic is a positive sign of the overall progress in medical sciences.

Wednesday, December 16, 2015

Dick pics

I sometimes come across videos like this

or pictures like this

This is very interesting to me because it shows how people have no idea what is going on in their own brains.

Let us start with the question: why an unwanted image of a dick causes us to have an emotional reaction (and it seems that both man and women have a similar reaction after looking at a stranger’s dick – some mixture of shame and the feeling of being intimidated)? I remember when I was talking to a male friend on Skype not so long ago. Another male friend of mine stood behind him and suddenly pulled his pants off and showed his dick to the camera. My immediate reaction was to curse and look away in disgust. The thing is that unlike many people who send dick pics, my friend perfectly knew what he was doing – he has read The Human Zoo by Desmond Morris. I did too, but the trick nevertheless worked.

So why do dicks evoke these emotions in us? Do we learn to be scared by dicks? When and how exactly do we learn about this? Somebody tells us? Or do we need an unpleasant experience with a dick that belongs to somebody else?

There is no reason to be afraid of a dick if you had none of such experiences. But people still do have these emotions. So what causes them?

Another example I know is when I was just a few years old and I was playing at a riverbank with a group of female friends, all of them a few years old. A dude with a mustache was riding a bike nearby. He stopped close to us, silently pulled down his pants and showed us his junk with a smile on his face. The girls started screaming and run away. I probably did the same thing, although I do not remember well, as it was so much time ago (by the way none of us thinks about this as a traumatic experience now; it was quite benign; after the incident the guy rode off and we never saw him again). Of course, nobody explained to us before that this was an appropriate reaction in such a situation. But it seemed appropriate. Why?

The answer is that this is our innate instinct. We have inherited from our ancestors some types of social interaction that are guided (among others) by genital display. People who observe primates know that the genital display is a way to communicate social status in a group. Dominant male individuals show their dicks much more often than other individuals. Human brains are wired in a similar way. Seeing somebody’s dick makes you feel intimidated and human intuition sometimes makes guys show their dicks in order to intimidate others.

A woman often feels disgust after seeing stranger’s dick, yet she may think that his intention was to arouse her and the man had no idea that his dick looked gross to her. A men asked why he sent a dick pic would probably say something like that: “it was a joke; I wanted to embarrass her; I like to show off my masculinity.” What is really going on, is that when a man wants to intimidate a woman (or less often another man), his primate intuition tells him to show off his dick. The woman verbally misidentifies his intentions but emotionally responds in an intended way. The act achieves its goal. Note that the man also somewhat misidentifies what really caused him to do this. It is because he acts on his animal instinct.

So, as it turns out, unwanted dick pics are a product of our ancestral way to ensure group cohesion through authority structure. These mechanisms have not much use nowadays but they still hang around aimlessly in our brains causing trouble. Moreover, the example of dick pics shows nicely how our verbal processing is disconnected from the part of the brain where we actually make decisions. Neither a man nor a woman verbally understand their own role in this situation – unless they are educated in anthropology or primatology.

If you want to read more, I recommend:

Darkness in the sense of justice

People have an innate sense of justice. Our intuition tells us that if somebody did something wrong, they have to be punished. Notion of karma and such are based in the human tendency to think that there is some cosmic justice. And people who figured out that there is no objective justice, take such justice as an ideal humans should strive for. A just world is what most of us are working towards.

So maybe before we start making decisions based on our sense of justice, it may be good to know where does it come from and whether we should trust it. And of course, as most other intuitions, the sense of justice is a product of natural selection. It is how nature wired our emotions in order to guarantee that we cooperate, enforce social cohesion, and so on. The problem is that nature tends to implement technological trade-offs in her designs, and the things she creates are not perfect. For an easy example, just look up the recurrent laryngeal nerve which connects the brain with the jaw but goes down to the chest for no apparent reason (which is a gross redundancy for a giraffe).

And here is the question I have kept on asking. Should we trust our intuitions? I would say – no. Our intuitions are the animal spirits that the nature equipped us with to deal with much different circumstances than those of today. And even when the circumstances are right, the animal spirits are not guaranteed to be perfect. As anything else designed by natural selection, they are technological trade-offs.

Following our innate sense of justice may lead to sub-optimal design of society, and thus to more suffering that it would be necessary if people behaved rationally. Rationality should be a yardstick against which we should judge how efficiently our instincts help us shape the society. If you want to rationally maximize social welfare, you should consider what are the specific consequences of the decisions you make, rather than follow your intuition. For example, it is reasonable to think that some punishment for crimes is necessary in order to deter people from committing crimes. But if such deterrence cannot be achieved, there is no rational reason to punish a person. Moreover – there may be good reasons to offer the person help in order to make them a better citizen rather than let them learn how to be a hardened criminal during their jail time.

You may cringe at the notion that some crimes should go unpunished. But this feeling is precisely the dark, irrational revenge-seeking sense of justice that was implemented in you by nature. If you want to build a better society, you need to set the feelings aside and perform strict cost-benefit analysis of the decisions you are facing. When you compare outcomes obtained with rationality to outcomes obtained with the human innate sense of justice, it is easy to see the darkness of our animal spirits. 

Further reading

For examples on how people deal with trade-offs between innate sense of justice and pragmatism, see:
Related peer-reviewed papers:

Monday, December 14, 2015

Undeserved saliency of thoughts and words

It happens very often that we put a great emphasis on what people think express verbally. I can see it in philosophical texts where, for example, philosophers deliberate what should be more important – expressed preferences or revealed preferences (see Decision Theory and Rationality by José Bermúdez, p. 64) or while listening to people who keep on talking about their beliefseven in absence of any decision making problem these beliefs could influence.

My take on it is that what ultimately matters is behavior and decisions. Description of human mental processes can help us understand some aspects of human behavior but is only a part of the picture. For example, if we want to see what a person truly wants, the action and the actual choice should be taken into account rather than what the person says she wants, even, or especially, when the two are in conflict. Similarly, it does not matter what convoluted theories people come up with in order to explain how their thoughts interact with their behavior. If their behavior can be fully explained by a simpler theory, then the convoluted ones should be discarded.

But why? Why am I so eager to demean human thoughts? The reason is simple. Verbal processing and speech are devices that serve some evolutionary purposes. I do not believe that providing a perfect window into the operation of the human mind is one of these purposes. On the contrary, we know that a lot of mental processes are unconscious. The part of the brain responsible for verbal processing is not connected to all other parts of the brain that are responsible for making decisions. Therefore, we are unable to describe fully what is going on in our heads. Furthermore, there aren’t even reasons to believe that spoken words are a perfect window into the part of mind that is available to verbal processing. It may well be a dirty window obstructing the view or a distorting mirror.

To give you an analogy – imagine that human nature is a picture of a ruined city with a single nice flower in the foreground. What people say about their thoughts can give you access only to the part of the picture that has the little flower, probably seen through a distorting lens. It is not wise to draw a conclusion about what the entire picture represents based on this little image only. If you want to know what the human nature truly is, you must go beyond the verbal processing and look at the entire picture.

The best way to think about it is to ask yourself a question: what could I learn about humans (and how) if they could not speak? Or even better: how would I go about learning about an alien species that I have no clue how to communicate with and who may be different to me in any aspect? If you can think about humans and analyze them as an alien species, then you are on a good way to be objective in your analysis of the human nature. But if you are focused on what people think that is going on in their heads – then you may be bound for a dead end. 

Wednesday, November 18, 2015

Big data regression

Problem formulation

You have a dataset in which each observation is an impression of a banner ad. You have a variable indicating success (say, click or conversion) and a lot of additional information (features), all of which are coded as binary: browser type, host, URL, placement, banner format, banner id, banner keywords, hosts the viewer has seen so far, how many times the viewer has seen particular banners, when did s/he see these banners, how many times did s/he click on which banners, how did s/he behave while shopping online, current date and time, geoid information, and so on.

The question is which features increase the chance of success and which decrease it? This is an important question if you want to allocate your advertising resources efficiently. The difficulty is that the number of observations is in billions and the number of available features is in millions.

A solution

A naïve approach is to create a table which informs how many successes and how many failures occurred when feature was present or absent. Then, you can compare success ratio in absence of the feature with the success ratio in presence of the feature. If the latter is higher than the former, then the feature indicates higher probability of success.

This approach is similar to calculating simple correlation between the feature and the success indicator. And thus, it suffers from endogeneity. If a combination of two features often occurs together, say a particular host and a particular banner, and both of them seem to have high correlation with the success indicator, you do not really know whether this is the banner that drives success, the host, or both.

In order to separate the effects of features, you need to calculate partial correlation, conditional on other features, rather than simple correlation. The straightforward way to do it is to perform an ordinary least squares regression on the data. Unfortunately, there exists no software that could handle amounts of data you have. Even if you limit the dataset to most common features – say top 5000 – you still end up with several terabytes of data to be processed by the regression algorithm. To focus attention, let us say that we need a way to perform a regression on n = 4 billion observations and k = 10 thousand features. If each variable takes up 4 bytes, the amount of memory required to perform such analysis equals nearly 160 terabytes.

Typically, linear least squares models are fit using orthogonal decomposition of the data matrix. For example, R package uses QR decomposition. One can use also singular value decomposition. Unfortunately, these methods require all data to be kept in memory and have algorithmic complexity of O(nk2).

Alternatively, one can calculate Gram matrix. This has algorithmic complexity of O(nk2) which can be reduced to O(np2) if the data are sparse (where p is the quadratic mean number of features per observation) and very easily parallelized. Another advantage is that memory requirement for calculating Gram matrix are O(k2) only and for k = 10000 the exact amount of RAM required to keep Gram matrix would be just under 200 MB (keep in mind that Gram matrix is symmetric). The only problem here is that to calculate regression coefficients, it is necessary to invert the calculated Gram matrix (which is often discouraged due to inferior numerical stability and takes O(k3)). The viability of this solution depends thus on whether it is possible to do it with satisfactory numerical accuracy. As it turns out, it is.

Note that popular machine learning engines like Vowpal Wabbit are not of much use in this situation. Machine learning is usually concentrated on prediction, rather than accurate estimation of model parameters. Engines like VW in principle are less accurate than OLS. They allow multi-collinearity of variables which in turn forces user to perform separate data analysis in order to eliminate it in the first place. Finally, they do not allow for standard statistical inference with the model parameters.


The plan was to create a C++ class able to do all operations necessary for this regression. The data were stored on a remote Linux server using Hadoop. I was planning to develop and debug my solution using Microsoft Visual Studio 2015 on my Windows 7 64-bit Dell computer (i7-4790 @ 3.6 GHz with 16 GB RAM) and then to port it to its final destination.

There were four initial things I had to take care of: (1) a way of measuring code performance, (2) a way of measuring numerical accuracy of matrix inversion, (3) C++ libraries for inverting matrices, and (4) a strategy for verifying accuracy of the entire algorithm.

Boy, was it hard to find a good way to precisely measure code execution time on Windows. Unfortunately, the usually recommended GetTickCount() Windows API function relies on the 55 Hz clock and thus has a resolution of around 18 milliseconds. Fortunately, I eventually found out about the QueryPerformanceCounter() function, whose resolution is much better.

Next, I decided to use the following measure for numerical precision of matrix inversion. Let us say that you need to invert matrix A. You use an inversion algorithm on it which generates matrix B. If matrix B is a perfect inverse of A, then AB = I, where I is the identity matrix. Hence, I calculate matrix C = AB – I. Then, I find the element of matrix C that has the highest absolute value and call it r. This is my measure of numerical precision. In the world of infinite precision, r = 0. In the real world r < 1e-16 is perfect (I use double – a 64 bit floating point type for my calculations). r < 1e-5 is still acceptable. Otherwise there are reasons to worry.

With tools for measuring performance and accuracy, I was able to start testing libraries. I initially turned to Eigen which was very easy to install and use with my Visual Studio. Eigen uses LU decomposition for calculating matrix inverse and was satisfying in terms of speed and reliability – up to the point when I tried to invert a 7000x7000 matrix. Eigen kept on crashing and I could not figure out why. The second option was thus Armadillo. Armadillo did not have the same problems and worked well with bigger matrices all the way up to 10000.

As it turns out, Armadillo can take advantage of the fact that Gram matrix is symmetric and positive-definite. The inversion is done by means of Cholesky decomposition and after a few experiments I realized that it is not only faster but also numerically more reliable than LU-based method. I was able to invert a 10001x10001 matrix in 283 seconds (in a single thread) with r = 3.13e-14. The irony is that both Cholesky decomposition and matrix multiplication work in O(k3) but the latter is over twice as slow, so it takes much more time to check numerical precision than to perform actual inversion.

Finally, I designed a data generating process to test whether least squares algorithm of my design can recover parameters used to generate the data. Essentially, I created 10001 variables xi for i=0, 1, 2, …, 10000. x0 = 1, always. For i>0 we have P(xi = 1) = 1/(3+i) = 1 – P(xi = 0). Then, I created a vector of parameters bi. b0 = 0.0015 and for any non-negative integer j, b4j+1 = 0.0001, b4j+2 = 0.0002, b4j+3 = 0.0003, and b4j+4 = -0.00005. Finally, P(y = 1) = x * b, where * indicates dot product. This is a typical linear probability model.

Using the formula above I generated 4 billion observations (it took 11 days on 4 out of 8 cores of my Windows machine) and fed them into the regression algorithm. The algorithm was able to recover vector b with the expected convergence rate. Note that by design the aforementioned data generating process creates variables that are independently distributed. I thus had to tweak this and that to see whether the algorithm could handle correlated features as well as to investigate the bias (see more about that in the last section).

Statistical inference

The question of how to recover model parameters from the data is simple. In addition to the Gram matrix, you need a success count vector. The i-th element in this vector indicates how many successes were there when i-th feature was present. Calculating this vector is at most O(np) in time and requires O(k) memory (note that none of the operations involved in calculating Gram matrix and success count vector are floating point operations – this is all integer arithmetic since we operate on binary variables only; thus both Gram matrix and success count vector are calculated with perfect numerical precision). Once you have them both, you need to invert the Gram matrix and multiply it by the success count vector. The resulting vector contains estimated model parameters.

However, getting standard errors of the estimated coefficients is a bit more complicated. Typically, we would use diagonal elements of the inverted Gram matrix and multiply them by standard deviation of the residuals. The problem is that calculating residuals requires going through all observations all over again. This not only increases the execution time. It poses a major technical difficulty as it requires the dataset to be invariant for the duration of the algorithm execution (which is assumed to be at least several hours). To fix this, one would have to tinker with the data flow in the entire system which can greatly inflate project’s costs.

Fortunately, there is trick that can rescue us here. Instead of quadratic mean of residuals, one can use standard deviation of the success variable. Note that the latter must be greater than the former: the former is the quadratic mean of residuals for the entire model and the latter is the quadratic mean of residuals for the model with a constant only. This guarantees that the standard errors will be overestimated which is much better than having them underestimated or all over the place. Moreover, for small average success ratio, the two will be close. In fact, it is easy to show that under some plausible conditions as the average success ratio goes to zero, the two are the same in the limit. And for banner impressions, the average success ratio (e.g. CTR) is, no doubt, small.

No amount of theoretical divagations can replace an empirical test. It is thus necessary to check ex post whether statistical inference using the above simplifications is indeed valid. To do that, I estimate a number of models (keep in mind that I have 10000 variables) and check how frequently the estimated coefficients are within the 95% confidence intervals. I expect them to be there slightly more often than 95% of the time (due to overestimation of standard errors) and indeed, this is what I find.

Finally, I cannot write a section about statistical inference without bashing p-values and t-statistics. I strongly discourage you from using them. A single number is often not enough to facilitate good judgment about the estimated coefficient. p-value typically answers a question like: “how likely is it, that the coefficient is on the opposite side of zero?” - Is this really what you want to know? The notion of statistical significance is often misleading. You can have a statistically insignificant coefficient whose confidence interval is so close to zero that any meaningful influence on the dependent variable is ruled out: you can then say that your data conclusively show that there is no influence (rather than that the data do not show that there is influence). Also, you can have a statistically significant coefficient with very high t-statistic, which is economically insignificant or economically significant but estimated very imprecisely. Thus, instead of p-values and t-statistics I suggest using confidence intervals. The question they answer is: what are the likely values of the coefficient? And this is what you actually want to know most of the time.

Data refinements

Oops. You have a nice OLS algorithm which supports valid statistical inference. You tested it with you generated data and it works fine. Now you apply it to real data and the Gram matrix does not want to invert or inverts with precision r > 1. You quickly realize that it is because the data have a lot of constant, perfectly correlated, and multicollinear variables. How to deal with that?

Sure, you can force users to limit themselves only to variables which are neither perfectly correlated nor multi-collinear. But when they are using thousands of variables, it may take a lot of effort to figure it out. Also, running an algorithm for several hours only to learn that it fails because you stuffed it with a bad variable (and it does not tell you which one is bad!) simply does not seem right. Fortunately, as it turns out, all these problems can be fixed with analysis and manipulations on the already-calculated Gram matrix.

The first refinement I suggest is dropping features that are present too few times (e.g. less than 1000). You can find them by examining diagonal entries of the Gram matrix. To drop a variable you can just delete appropriate row and column from the Gram matrix as well as corresponding entry form the success count vector. After such a delete operation, what you are left with is the same as if you did not consider the deleted variable to begin with. Clear cut.

The second refinement I suggest is to drop features with not enough variability. Based on the Gram matrix and the success count vector, it is possible to construct a variability table for every feature (the same one I described as the naïve solution at the beginning of the article). This table has two rows and two columns – rows indicate whether there was a success and columns indicate whether the feature was present. Each cell contains the number of observations. So you have the number of observations that had a feature and there was a success, a number of observations that had a feature but with no success, a number of observation without this feature but with success, and a number of observations with neither feature nor success. I drop features for which at least one of the four cells has a value lower than 10.

As we proceed with the third refinement, note that you can easily calculate correlation between any two features based on the content of Gram matrix. Just write out the formula for correlation and simplify it knowing that you are dealing with binary variables to realize that you have all information you need in the Gram matrix. This of course allows you to identify all pairs of perfectly correlated or highly correlated variables in O(k2) time. I got rid of a variable if I saw correlation whose absolute value exceeded .99 (doing, say, .95 instead of .99 did not dramatically improve speed or numerical precision of the algorithm).

But now comes a biggie. How to find features that are perfectly multicollinear? One naïve approach is to try to find all triples of such variables and test them for multicollinearity, find all quadruples, quintuples, and so on. The trouble is that finding all n-tuples can be done in time O(kn) which is a nightmare. Alternatively, you can try to invert submatrices: if you can invert a matrix made up of first p rows and columns of the original Gram matrix, but you cannot invert a matrix made up of the first p+1 rows and columns of the original, it surely indicates that the variable number p+1 causes our Gram matrix to be singular. But this solution has a complexity of O(k4) which for high k may be very cumbersome. There must be a better way.

As it turns out a better way is to perform a QR decomposition of the Gram matrix (not to confuse with QR decomposition of the data matrix as a part of the standard linear least squares algorithm). The diagonal elements of the R matrix are of interest to us – a zero indicates that a variable is causing problems and needs to be eliminated. QR decomposition generates the same results as the “invert submatrices” algorithm described above – but it runs in O(k3). And, of course, it is a good practice to check its numerical precision in a similar way we were checking numerical precision of matrix inversion algorithm.

Finally, note that you can sort the Gram matrix using its diagonal entries. I sort it descending so that features that get eliminated are always the features which occur less frequently. It is probably possible to achieve higher/lower numerical precision by sorting Gram matrix, however I have not investigated this issue extensively. I only noticed that in some instances sorting the Gram matrix ascending made the LU inversion algorithm fail (too high r) while sorting descending or not sorting did not affect the LU algorithm much.

All these operations require some effort to keep track of which variables were eliminated and why, and especially how variables in the final Gram matrix (the one undergoing inversion) map to the initial variables before the refinements. However, the results are worth the effort.

Integration and application

The task of integrating new solutions with legacy systems may be particularly hard. Fortunately, in my case, there already existed data processing routines that fed off of the same input I needed (that is a stream of observations in a sparsity supporting format – a list of “lit-up” features), as well as input generating routines that filtered original data with given observation and feature selectors.

I had a shared terminal session using Screen with people responsible for maintaining C++ code for analysis done on these data to-date. We were able to link up my class within the current setup so that users can use the same interface to run the regression that they used previously to do other type of analyses. Later on, I had to do some code debugging to account for unexpected differences in data format but ultimately everything went well.

The first real data fed to the algorithm had 1.16 billion observations and 5050 features. Calculating Gram matrix and success count vector took around 7 hours. Due to refinements, the number of features was reduced to 3104. Inverting matrix took just a few seconds, and the achieved precision was around 2e-7.


In this final section I would like to discuss three potential problems that do not have easy solutions: variable cannibalization, bias, and causality.

It often happens that a number of available features refer to essentially the same thing. For example, you may have features that indicate a person who did not see this banner in past minute, 5 minutes, hour, and day. These features will be correlated and they have a clear hierarchy of implication. A user can make an attempt to run a regression using all these features expecting that the chance of success will be a decreasing function of the number of impressions. However, the effect of a viewer who has never seen the banner will not be attributed entirely to any of the aforementioned features. Instead, it will be split among them, making the estimated coefficients hard to interpret. This is the essence of cannibalization – similar variables split the effect they are supposed to pick up and therefore none of them has a coefficient it should have (please let me know if you are aware of a better term than “cannibalization”). The simple but somewhat cumbersome remedy for it is to manually avoid using features with similar meaning in one regression.

Secondly, it is widely known that linear probability model generates bias. The biased coefficients are usually closer to zero than they should be. To see why, consider a feature whose effect is to increase probability of success by 10%. However, this feature often occurs with other feature whose presence drives the probability of success to -25% (that is zero). Presence of the feature in question can at best increase the probability to -15% (that is still zero). As a result the feature in question does not affect the outcome in some sub-population due to negative predicted probability. Its estimated effect is thus smaller (closer to zero) than expected 10%.

Note that the reason why linear probability model generates biased results is not because the regression algorithm is flawed but because the model specification is flawed. The P(y = 1) = x * b model equation is incorrect if x * b is smaller than zero or bigger than one because probability by definition must be between 0 and 1. Whenever x * b is outside these bounds, the coefficients end up being biased. That is, OLS correctly estimates partial correlation between independent and dependent variables, but, due to data truncation, partial correlation is not what is needed to recover the linear probability model parameters.

The resolution of this issue may go towards assuming that model specification is correct and finding ways to alleviate bias or at least towards identifying features whose coefficients may be biased. On the other hand it may be also possible to assume that the linear probability specification is incorrect and to investigate whether partial correlation is what is really needed for the decision problems the estimates are supposed to help with. I consider solving this problem an issue separate from the main topic of this article and I leave it at that.

Finally, I would like to make a note on causality. Partial correlation, as any correlation, does not imply causation. Therefore, it may turn out that a particular feature does not have a causal effect on probability of success but instead is correlated with an omitted variable which is the true cause of the change in the observable behavior. For example, one host can have a higher conversion ratio than the other. However, the reason for that may be that the advertised product is for females only. The population of females may be much smaller for the second host even though higher fraction of them buys the product. In such case the second host is actually better at selling the product (that is it is better to direct generic traffic to the second host rather than to the first one) but this information is obscured by inability to distinguish between male and female viewers. It is thus important to remember that the regression provides us only with partial correlation rather than proofs of causality.

The issue of causality is of extreme importance when we are trying to predict effects of policy (like redirecting traffic in the example above). However, when instead of policy effects, we are interested in predictions, partial correlation seems to be a sufficient tool. For example, you may want to know whether people using Internet Explorer are more likely to click on a banner, even though you do not have the ability to influence what browser they are using. In such situations establishing causality is not necessary.