Laserlike

Using data to make predictions.

November 8, 2008 · 10 Comments

Companies invest enormous sums of money on data warehouses.  They then spend much more on various business intelligence tools and training.  I have used many of these tools in previous roles, and basic use of of the Data Analysis package in MS Excel is far more advanced than what I have seen used at most companies.

1.  Identify correlations within your data.  

Many companies’ data mining efforts bring to mind Jorge Luis Borges’ Library of Babel.  Raw data is worthless.  While this statement self-evident, it’s clearly not appreciated as most firms make poor use of their data warehouses.  Business intelligence should help us make predictions about the future.  A good place to start would be to identify correlations in data and then to extract functions (e.g., sales increase x% when customers spend y minutes on page z of or our web site).  

To start, simply take whatever data you have and do a scatter plot in Excel.  If it looks like the data “lines up”, run a regression to extract the function (on Excel for Windows, use the Data Analysis tool; on the latest Excel for Mac you will need to do the least squares work manually).  Brief tangent – I would love to see someone build a simple “correlation machine” that could ingest any CSV file, find the best correlations within the data, and then spit out functions in descending order based on correlation coefficients.  

2.  Use randomized testing to identify causation.

Correlation does not equal causation.  The best way to find out if a correlation is meaningful for predictions is to do A/B testing.  If we send 50% of a small group of customers to page z1 of our site, sales are x dollars.  If we send the other 50% of that small test group to page z2, sales are 1.5x dollars.  If the function can make predictions, great.  If not, go back to #1 and try again.

3.  Your intuition sucks.

If you haven’t already, read Michael Lewis’ Moneyball.  Just as with baseball scouts, most business people believe that they have good intuition.  And, just as with baseball scouts, most business people would be well served to add statistics to their arsenal of decision making inputs.  

The larger the organization, the more people will differ with respect to their intuition.  Of course, the decision is ultimately made by the most senior or the most vocal person (more often than not, the correlation between these two things is very high).  Oddly enough, senior people are often further away from the evidence and, consequently, are not well positioned to make the call.   Intuition may be an entirely reasonable replacement for #1, but it is most certainly not a good replacement for #2.

This is not a suggestion to replace the way decisions are made in your organization today with algorithms, but rather another tool.  As basic as this all sounds, I have been surprised by how few people really use basic statistics to help make decisions.  And if most executives [continue] to ignore this advice, it’s just more opportunity for the rest of us…

Categories: ideas
Tagged: , , , , , ,

10 responses so far ↓

  • Howard Keziah // November 8, 2008 at 11:38 am | Reply

    I’m really glad I found this post (thanks Atul). We’re in the process of learning to make better use of the data we have available to make better marketing and product development decisions. Excel is on every computer in our office and we overlooked the data analysis package. Duh.

  • joshua schachter // November 13, 2008 at 5:24 am | Reply

    Argh.

    No, just finding correlations is not that useful. You’ll find mere correlation when you’re hoping for causation. You’ll find colinearity when you’re hoping for relationships, etc.

    FIRST you construct a model, even if it’s just a mental one. Then based on that you TEST possible relationships that the model predicts (your #1) and construct experiments that vet the model (your #2)

    Joshua

  • Mike Speiser // November 13, 2008 at 6:05 pm | Reply

    I agree that correlation alone is not enough. That’s why I suggested that you find out if the extracted function is predictive by running randomized tests.

    And I agree that you should have some objective function in mind when you set out — for example, “I hope that my behavioral targeting increases purchase rates as confirmed by the snippet of code that I have asked advertisers to put on their checkout pages.” Or even that “I’m trying to extract the function that maximizes revenue in my online (or offline) store by optimizing merchandising placement/mix.” If that’s what you mean by mental model, agreed. It’s a very good clarification.

    But if you mean that you need to have a hypothesis with respect to a specific causal relationship between variables in advance of a regression or the like (mental or more formal), then I respectfully disagree. In many cases, based on human intuition alone, we have limited visibility as to which variables lead to some desired outcome. That Peanut M&M’s and Valvoline Motor Oil should be placed side by side at Wal-Mart because people buy both together, for example. Or that when I type the query “schafhter” into Google that I really wanted schachter.

  • joshua schachter // November 14, 2008 at 12:50 am | Reply

    I’m not saying you just build a model from intuition alone; that is silly.

    Your notion of randomized testing is part of a model-building exercise (models can be mechanically tested) and doesn’t contradict what Im saying.

    But you need to have some understanding of how it works so that you can expect the changes in output in response to differential inputs. Otherwise you’re just datamining. Your relationships will work up until the point they don’t, and you will have no insight into why they do or do not.

    If you have a large number of variables, you will get a bunch of correlations simply spuriously. And some will just be due to colinearity. How will you reject these without understanding them?

    That diapers-next-to-beer example (or motoroil and peanuts, whatever floats your boat) is largely apocryphal, isn’t it? My understanding is that it is mostly pay-for-placement.

    I do not think the query example makes any sense other than being non-intuitive, but having nothing to do with modelmaking or whatever.

  • Mike Speiser // November 14, 2008 at 5:14 am | Reply

    The merchandising example is from some article I read years ago about Wall-Mart. Based on discussions with friends in retail, I do think massive data mining does play a role in merchandising today.

    Or how about BT — do you really need to have a model for “people who visit this type of site buy this type of product” in advance for every site, all content, and every product? From data I’ve seen, machine-learning on massive data sets dramatically increases ad click-through rates? Certainly BT is better than random ad placement or placement based on some periodic and crude taxonomic approach based on tagging category pages.

    And what about evidence-based medicine? While the ideal is some model that can be derived from scientific first principles, the reality is that much of medicine today is based on crude heuristics on a physician by physician basis. As you know, many of these heuristics are nothing more than informal observations of historic correlation — it’s a big problem in medicine today. Aggregated statistics could potentially identify relationships between things that are counter-intuitive based on our existing knowledge. Formal randomized testing of these correlations (like what exists in drug trials), would help us understand what’s causation versus random correlation.

    So many things in the world are too complex to develop any sort of model [mental or other] in advance. For many situations your approach is preferable. But for many other things, I think we would benefit from having a high-level objective (purchase, clicks, accurate diagnosis) and then looking for the needle in the proverbial haystack using multi-variate regression (or something of the sort) tuned by randomized testing.

    Kevin Kelly does a nice job discussing this in his bit on The Google Way of Science.

  • joshua schachter // November 18, 2008 at 7:20 am | Reply

    I guess. How many possible combinations of placements of things in the store exist? I am sure this process is at least seeded by people and then possibly further optimized.

    Nonetheless, your title suggests this is about making predictions. Exhaustive or semi-exhaustive testing is backwards-looking and thus not about making predictions, no?

  • Mike Speiser // November 18, 2008 at 11:00 pm | Reply

    On the # of combinations, ideally you could develop a system that could handle a nearly infinite number of variables (at least millions). To the extent that such a solution is unrealistic, then you could narrow the problem by setting systematic constraints — e.g., starting with the top 10% of products with respect to profitability or the like.

    No doubt that building such a system would require a great deal of thought and would likely require very different design decisions for each use-case (e.g., a search system would unlikely help you solve the problem for medical diagnosis). But these are precisely the types of problems for which I think it’s unrealistic to expect humans to develop mental models. They are far too complex and they are often non-intuitive. How about ad targeting? And search. And genetic testing. And medical diagnosis. And…

    On the concern about backwards-looking testing — I agree that doing such tests on historic data alone cannot identify causation. But such tests can surely suggest potential candidates. My proposal includes not only looking for relationships based on past data (with and sometimes without human involvement in model formation), but also running randomized tests (on real audiences) to identify causation based on the correlations derived from historic data.

    The end result is some sort of function which offers a predicted result f(x) for a given x. Isn’t that a prediction? Sure it’s just an approximation, rather than the exact answer you would get from an equation in Newtonian physics. But for many systems in the world, the competition isn’t a precise equation or even a human with a mental model. It’s often “gut instinct.” And that’s raw meat for entrepreneurs.

  • Jake // November 21, 2008 at 12:12 am | Reply

    I think there is a wonderful benefit to Mike’s approach: Finding new relationships faster. The ‘blind’ approach can identify relationships that emerge because of the inevitable change brought on by complexity. By using an approach that requires a pre-identified understanding of causation requires a tremendous investment of time into each variable — knowledge bases that have been built over years under models and may be aged. Mike’s approach allows for a rapid reaction to dynamic markets.
    Maybe putting peanut m&m’s into one’s car oil helps with mileage and people are sharing that insight on twitter. (Absurd example, but go with it.) By performing regular, frequent iterations of the A|B process will surface that consumer need faster than trying to precognit it. For a general retailer selling thousands of items that each have complex use cases and many potential new permutations, it may be easier to just look for the correlations and optimize the mix that way.
    An issue arises when you follow that strategy to a direction that moves your business model away from providing fundamental value to the customer. Say, Wal-Mart’s product positioning becomes dominated by random associations that remove any type of coherency for the customer. But I’d argue that’s not likely to happen because the correlations associated with strong causation will dominate the results and keep a natural order to your business. Besides, as a business owner you can always use “common sense” and choose not to deploy identified correlations.
    So a big question that I’m coming up with through this dialogue is: Are there scenarios that favor one form (apriori & A|B) versus the other? I can see how this process is easily applied to ecommerce and web 2.0 as changes to the business are a matter of changing code. However, for models that have a greater reliance on the physical world and/or time component, then A|B is clearly not optimal.

  • Adrian Scott // January 25, 2009 at 7:33 pm | Reply

    One note re A/B testing with small groups… Given the power-law distribution involved with user/member activity/value, it’s important to make sure that the sample size is still large enough to have valid results that aren’t skewed by a heavy connector landing in one of your test groups… enjoy!

  • Graphic and Web Designer // May 26, 2009 at 2:38 pm | Reply

    Thanks for the info. A correlation finder would be extremely helpful… and profitable for whoever creates it.

Leave a Comment