Wednesday, September 12, 2018

Simpson's Paradox

“There are three types of lies -- lies, damn lies, and statistics.”Benjamin Disraeli


Simpson's paradox is a phenomenon in which a trend appears in several different groups of data but disappears or reverse when these groups are combined. It is important to understand this concept to correctly interpret the data. Edward H. Simpson first addressed this phenomenon in a technical paper in 1951, but Karl Pearson et al. in 1899 and Udny Yule in 1903, had mentioned a similar effect earlier. 

For example, you and a friend each do problems and your friend answers a higher proportion correctly than you on two days that you've competed. Does that mean your friend has answered a higher proportion correctly. Not necessarily!


  • On Saturday, you solved  out of  attempted problems, but your friend solved  out of  You had solved more problems, but your friend pointed out that he was more accurate, since . Fair enough.
  • On Sunday, you only attempted  problems and got  correct. Your friend got  out of  problems correct. Your friend gloated once again, since .
However, the competition is about the one who solved more accurately over the weekend, not on individual days. Overall, you have solved  out of  problems whereas your friend has solved  out of  problems. Thus, despite your friend solving a higher proportion of problems on each day, you actually won the challenge by solving the higher proportion for the entire weekend. 
Simpson's paradox can also arise in correlations, in which two variables appear to have (say) a positive correlation towards one another, when in fact they have a negative correlation, the reversal having been brought about by a "lurking" confounder. When two different contexts compel us to take two opposite actions based on the same data, our decision must be driven not by statistical considerations, but by some additional information extracted from the context. Berman et al. give an example from economics, where a dataset suggests overall demand is positively correlated with price (that is, higher prices lead to more demand), in contradiction of expectation. Analysis reveals time to be the confounding variable: plotting both price and demand against time reveals the expected negative correlation over various periods, which then reverses to become positive if the influence of time is ignored by simply plotting demand against price.
Resources

Monday, August 27, 2018

Classification and Regression Trees

Classification and Regression Trees also known as CART refers to decision tree algorithms that can be used for classification or regression predictive models. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each prediction. In other words creating a CART model involves selecting input variables and splitting points on those variables until a suitable tree is constructed.  The representation of CART model is decision tree. The good thing about CART in terms of data is that it does not require any special data preparation other than a good representation of the problem.

Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost.
For classification using CART algorithm Gini index function is used which provides an indication of how "pure" the leaf nodes are ( how mixed the training data assigned to each node is).

Regression trees are for designed for dependent variables that take continuous or ordered discrete values, with predication error typically measured by the squared difference between the observed and predicted values.

Advantages of CART

  • Simple to understand, interpret, visualize.
  • Decision trees implicitly perform variable screening or feature selection.
  • Can handle both numerical and categorical data. Can also handle multi-output problems.
  • Decision trees require relatively little effort from users for data preparation.
  • Nonlinear relationships between parameters do not affect tree performance.

Resources

Tuesday, August 7, 2018

Essential Machine Learning Algorithms

An algorithm must be seen to be believed."  ~Donal Knuth

For anyone new in data science the first problem they face is which algorithms to learn. There are a ton of machine learning algorithms that they can learn, but first they need to decide where to start? Here is a list to most essential machine learning algorithms to start with. This is not the most comprehensive list of algorithms but it's just enough to get you stared.

Supervised Learning Algorithms

Unsupervised Learning Algorithms

  1. Clustering
      2. Visualization and Dimensionality Reduction

Monday, July 30, 2018

ML Quick Bites : XGBoost

XGBoost stands for eXtreme Gradient Boosting, it was developed by Tianqi Chen and now is part of a wider collection of open-source libraries developed by the Distributed Machine Learning Community (DMLC).

XGBoost is the implementation of gradient boosted decision trees designed for speed and performance.
Important features of implementation include handling of missing values (Sparse Aware), Block Structure to support parallelization in tree construction and the ability to fit and boost on new data added to a trained model (Continued Training).

Algorithm

It implements gradient boosted decision tree algorithm. Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.

Gradient boosting is an approach where new models are created that predict the errors of previous models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

Tasks

1. Binary Classification
2. Multi-class Classification
3. Regression
4. Learning To Rank

Pros

1. Execution Speed
XGBoost is really fast compared to the other gradient boosting

2. Model Performance
XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems.

3. Handling Missing Values
XGBoost has an in-built routine to handle missing values. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future.

4. Built-in Cross Validation
XGBoost allows user to run a cross-validation at each iteration of the boosting process and  thus it is easy to get the exact optimum number of boosting iterations in a single run.

Learn more about XGBoost

1. A Gentle Introduction to XGBoost
2. XGBoost: A Scalable Tree Boosting System
3. Trevor Hastie - Gradient Boosting Machine Learning
4. How to Develop Your First XGBoost Model in Python with scikit-learn

Sunday, July 22, 2018

The Origin Of Reason

"If we think that we have reasons for what we believe, that is often a mistake." Prof Daniel Kahneman, Princeton University

We can all reason from our childhood onwards - but how? When we think about how we came up with the reasoning of justifying something. We don’t go very far because we have no definite explanation. We arrive at reason from an idea when we want to justify the idea and think that its correct. We don't rely on the laws of logic or probability - we reason by thinking about what's possible, we reason by seeing what is common to the possibilities. 

In an article at Edge.org, there's a great conversation with Mercier, now a post-doc at Penn. Mercier begins by explaining how the argumentative theory of human reason can explain confirmation bias:

Psychologists have shown that people have a very, very strong, robust confirmation bias. What this means is that when they have an idea, and they start to reason about that idea, they are going to mostly find arguments for their own idea. They're going to come up with reasons why they're right, they're going to come up with justifications for their decisions. They're not going to challenge themselves.
And the problem with the confirmation bias is that it leads people to make very bad decisions and to arrive at crazy beliefs. And it's weird, when you think of it, that humans should be endowed with a confirmation bias. If the goal of reasoning were to help us arrive at better beliefs and make better decisions, then there should be no bias. The confirmation bias should really not exist at all.
But if you take the point of view of the argumentative theory, having a confirmation bias makes complete sense. When you're trying to convince someone, you don't want to find arguments for the other side, you want to find arguments for your side. And that's what the confirmation bias helps you do.

The idea here is that the confirmation bias is not a flaw of reasoning, it's actually a feature. It is something that is built into reasoning; not because reasoning is flawed or because people are stupid, but because actually people are very good at reasoning — but they're very good at reasoning for arguing. Not only does the argumentative theory explain the bias, it can also give us ideas about how to escape the bad consequences of the confirmation bias. 

The meaning of reason is logical defense. But that’s more of why we reason, the other question is how we reason. It could be because we all have some form of pre-mandated mindset about something and the next time we come to the same situation without knowing it, we conclude to the reason we have all ready justified for ourselves. For example when we go online and we want to shop for some green tea, if we don’t know what brand and what type of green tea we want we start comparing different types of teas with the prices and reviews about the brand from the people who’ve already bought the product. Even though we have no experience in buying green tea we find ourself some form of reason to believe that whatever we’ve selected is the best. Now the next time when we go to a supermarket to buy the green tea we already have the reason in us to decide what to look for. If someone points out that we should take something else we try to come up with all kinds of reason, why the product I’ve selected is the best. In the act we try to persuade other people that what we believe is true.

Sunday, July 15, 2018

Why We Ignore The Threats

How many times we delay the report that have to submitted till the last days are near? The time when you knew that the engine oil in your car has to be changed but you decide that its not a danger and end up paying far more in repair than the cost of the engine oil change. We all do it, but sometimes these obvious things that we ignore can be catastrophic not just for us but all also for the people around us. Why we don’t act on the threats immediately when it could have saved us a lot of trouble? One of the reason we tend to do this is because we don’t want to imagine the outcome that can come after if the threats are not handled when should have been. People tend to procrastinate the actions if they are unable to see the outcome that will be followed. Thinking that the process will go as we have planned can be dangerous for our career and society. It's in the human nature to overestimate its own predictions and act according to it. 

Being able to analyze and act on a threat is the best thing anyone can do for themselves. When we start to act on the threat that are going to be catastrophic in the future we take the charge in our own hands. There might be some threats whose actions are nowhere to be seen in the near future but that doesn’t mean it will not be catastrophic in the end. There is a Persian proverb:

He who knows not, and knows not that he knows not, is a 
fool; shun him.
He who knows not, and knows that he knows not, can be
taught; teach him.
He who knows, and knows not that he knows, is asleep;
Wake him.
He who knows, and knows that he knows, is a prophet;
Follow him.



Some Remarks on The Corrections by Jonathan Franzen

In 2001 when The Corrections was published it was regarded as the most important book of the 21st century. Some of it was due to the tim...