Many of the statistical techniques are useful for both prediction and explanation in the world of advanced analytics. However, certain techniques are necessarily more suited for one than the other. For example, techniques such as Mixed Linear Models are primarily used for explanation. In terms of predictive techniques there are two ‘streams’ of innovation, though the distinction between them as time progresses is becoming less clear as practitioners of each borrow from the other. The two streams are statistics and machine learning.
The online chapter of ‘Advanced Analytics Methodologies: Driving Business Value with Analytics,’ by Michele Chambers and Thomas W Dinsmore discusses each of these two ‘streams’ in turn; offering examples of the various ways that each can be discussed. Each one, having very different legacies from the other.
Statistical methods, for example linear regression, use known properties to estimate the parameters of mathematical models. These models have the advantage of being generalized. If you can demonstrate that the historical data will conform to a known distribution then this information can be used to predict behavior for new cases.
The analogy is that if you can predict the landing spot of an artillery shell, given its starting position, velocity and acceleration – so too can you predict the response to a marketing campaign based on information about a customer’s past shopping habits, demographic characteristics etc. The only limit to this is the ability to show that it follows a known statistical distribution. This is the large disadvantage, since all too often real world actions do not conform to statistical distributions.
Machine learning is fundamentally different in one major way: they do not start from a particular hypothesis, but seek to learn and describe the relationship between historical data and the target behavior as closely as possible. Since they are no longer constrained by any set of specific statistical distributions, often they are more accurate than their statistical counterparts.
On the other hand, machine learning models can ‘overlearn,’ meaning they can learn relationships from their training data that cannot be generalized to the wider world. This requires in-built techniques to control and limit this phenomenon, such as cross-validation or pruning on an independent sample.
Some techniques used, such as linear regression for the ‘statistical stream’ are well understood, widely used and broadly available. Whereas other methods like deep learning, part of the ‘machine learning stream,’ are relatively new. Scientists are still in the process of understanding the technique’s limits, and software implementations are rare.
For business, it is not necessary understand the technical details of each technique, but focus on two principles. First, to recognize that experimentation with a broad spectrum is often required. Second, while theoretical limits are interesting academically, the actual performance in application should be the sole measure of any model.
Big Data and related technologies – from data warehousing to analytics and business intelligence (BI) – are transforming the business world. Big Data is not simply big: Gartner defines it as “high-volume, high-velocity and high-variety information assets.” Managing these assets to generate the fourth “V” – value – is a challenge. Many excellent solutions are on the market, but they must be matched to specific needs. At GRT Corporation our focus is on providing value to the business customer.
Stories of major data breaches continue to roll in. One victim announced during the spring was hard drive maker LaCie...