Is It Real? Does It Matter? Interpreting Data Warehouse Reports

By Jarrett Rosenberg

Now that your data warehouse is online and flooding you with information, how do you separate the wheat from the chaff?

The data warehouse revolution harnesses modern computer technology, allowing managers of even the largest organisations to see and evaluate the functioning of all the different parts of the organisation. But the huge volume of information provided by such warehouses raises serious questions of how to make the most ffective use of the data and how to avoid new kinds of information management mistakes. In this article, I offer some rules of thumb to use in evaluating reports from your data warehouse.

The first thing to consider in looking at a report is whether or not the findings in it are really there. Taking action on a false alarm not only wastes effort, but can actually create problems where none existed before. Determining what's real and what isn't requires knowing the two major sources of data illusions: bad data and random variation.

Bad Data
Data quality is the difference between a data warehouse and a data garbage dump. Most databases contain data whose reliability is suspect; even the simplest audit or consistency check will turn up disturbing results, such as repair times with zero (or even negative!) values. There are many reasons for bad data, but a major one is that typically the producer and consumer of the data are not the same. For example, a trouble report in a service desk database is primarily used as a docket by customer service personnel working on the problem. Since fixing the problem as soon as possible (and moving on to the next one) is the highest priority, there is less interest in making sure that the details of the service process are accurately recorded. After all, the support staff don't use that data very much themselves, and they never use it in an analytic capacity. The result is that even basic information may be missing or incorrect. Solving the problem usually involves some sort of field validation mechanism, either on- or offline.

Some amount of bad data is inevitable, and usually tolerable, but it depends on how and why the data is bad. In the case of negative time periods, it may be due to some peculiarity of data entry in some cases, or it may be due to a systematic subtraction from all resolution times, which only becomes apparent in cases with small values, which become negative as a result. Until you know that you can trust the raw data, you can't trust anything derived from it. An essential part of constructing a data warehouse is cleaning the data as it is warehoused, but in the haste to complete the job, this often is done poorly or not at all.

How can you tell if the underlying data may be bad? By the time it has been boiled down into reports, it can be very hard to tell, but three techniques can help. First, do a simple sanity check. Are the reported values believable? Are they in the range that experience suggests they should be? Obviously out-of-line numbers are a sure signal that something is wrong somewhere. (Be wary, though, of assuming that the data must be correct because it fits with your expectations. This can lead to a cycle of self-deception that may not be broken until it is too late.) Second, look for consistency within and among reports. It's surprising how often the totals simply fail to add up or differ depending on which way they are broken down or who is reporting them. Finally, ask for than more than just summary statistics to be reported. Instead of just the average, look at the highest and lowest values. Even better, look at a graph of the distribution of values. Is it skewed? Are there outliers? A good graphic can convey much more information and make it easier to spot suspicious cases.

Random Variation
Even if you've maximised the quality of your warehouse's data, you may still be faced with findings that are illusory. This is because it's easy to forget the role of random variation in measurements. The most common use of data is comparison, and perhaps the most common form of comparison in business is comparison over time: plotting the same measure over time and looking for improvement or deterioration. Unfortunately, most things being measured - whether they are production numbers, customer satisfaction ratings, or what have you - will show variation over time even if the underlying process is constant. This is because random variation due to measurement error, sampling error, and the like is always present. If this random variation isn't taken into account, it is all too easy to imagine that some sort of change has taken place when it really hasn't. And trying to make adjustments for changes that didn't really occur only leads to trouble. There are two basic ways of handling variation. The first is to minimise it by some sort of adjustment. For example, Figure 1a shows software defects reported each month over the course of a year.

The fluctuations would cause any manager concern - if they were real. But they are in fact due to sampling error: the assumption is that each month's sample of reports is the same, but of course they differ since each month has a different number of working days. If we adjust for this fact by plotting the number of defect reports per work day for each month as in Figure 1b, we see that in reality the reporting rate is not changing at all - the variation is due to the number of work days.

The second way of handling random variation is to use statistical techniques that take it into account. A common method in quality control is the statistical process control (SPC) chart, which plots data along a baseline with "control limits" on either side showing when deviation from the baseline is likely to be more than just random variation. A more general method is that of the statistical test, which compares two numbers (or preferably, sets of numbers) and gives the probability that the difference between them is due solely to chance. (This is the famous "p-value," which traditionally should be less than 0.05, meaning that the observed difference has only a 5% chance of being due to random variation.) Statistical tests can be performed on just about any sort of data, but they do require some expertise to apply. Is the investment worth it? That depends on how much it's worth to know which findings are real and which are just "noise."

An additional problem with random variation is that most people have fairly limited intuitions about it. These intuitions are usually right, but sometimes they can be terribly wrong. For example, suppose that your service desk reports that the average time it takes to resolve a problem is four hours. What does that mean? Most people think it means something like Figure 2a: a distribution of resolution times that forms a bell-shaped curve centred on four hours. The reality, however, is much more likely to look like Figure 2b: a highly skewed distribution, most of which lies below four hours! The best way to avoid this pitfall is to look at the entire distribution rather than just relying on a single statistic; for example, to report the median as well as the mean. In a symmetric distribution the two numbers are very close, while in skewed distribution they are very different.

The first step, then, is to determine whether a finding is real or illusory. After that, you have to ask yourself, "Does it matter?" You can waste with real but irrelevant issues as with illusory ones.

Statistical significance versus Practical Significance
To begin with, people are often misled by the term "statistically significant." This term is misleading because it means "statistically reliable"; that is, not due to chance. It says nothing about practical significance, only about whether or not some observation is real or just a statistical fluke. The reason for this is that statistical significance is a function of both the size of the observed difference between two numbers and the size of the sample on which the observations were made. As a result, with huge samples like those in data warehouses, just about any difference can be statistically significant, even if the magnitude of the difference is miniscule. After determining if a difference is real, you have to ask if the difference is important. The answer to that depends on the size of the difference, whether or not it aggregates, and its strategic or tactical importance. The basic size of the difference is readily enough understood, but the other two factors require some explanation.

The Role of Aggregation
When a difference is small, the first reaction is to judge it unimportant, but that is only true if the difference doesn't aggregate. And it may aggregate from one perspective but not another. For example, say one service desk can handle 50% of its calls within two hours, and another can handle 50% of its calls within one hour and 45 minutes (and suppose that this is a real difference, not just due to random variation in our measurements). Does this saving of 15 minutes matter? From the customer's perspective, probably not; even though it's a 12.5% reduction, by the time you've waited an hour and 45 minutes, an extra 15 minutes probably doesn't seem like much. From the perspective of the service desk, however, it may mean a lot. It may mean that personnel are able to handle several more calls per day, which adds up across the number of staff and the number of days in a year. Thus the savings may be considerable when aggregated.

Where's the Leverage
Even if a difference is small and doesn't aggregate, it can be of practical importance if it has strategic or tactical value: leverage. A common problem in large organisations is a lack of understanding of how what one part of a company does affects the others - and customers. A primary motivation for data warehousing is the ability to detect such leverage relationships, where changing behaviour in one part of an organisation will directly or indirectly have a positive effect elsewhere. For example, at Sun Microsystems, we do periodic surveys of service customers to find out how satisfied they are with the service provided. Since the survey sample is taken from the service customer database, it is possible to connect Sun's performance in servicing the customer with the latter's satisfaction rating.

Figure 3 shows the result: a direct demonstration of exactly how altering service response time can affect customer satisfaction. This sort of leverage has long been the goal of total quality management, but it is only now being realised with the advent of data warehousing. A new generation of tools (e.g., BusinessObjects™) and consultancies (e.g., Marketing Management Services, Ltd.) has risen to realise these new advantages.

While data warehousing has launched a new era in information management, the flood of information it unleashes creates new challenges. Indeed, the sheer volume of information is so impressive and so overwhelming that people often become cast adrift in it and forget why it was collected and reported in the first place: to answer management questions and guide management actions. Faced with a report detailing various findings from a data warehouse, every manager should start by asking two simple questions: Is it real? and Does it matter?

Back to news articles

© Marketing Management Services International 2011