Garbage In, Garbage Out
–pretty much everyone who’s ever taken a course or read a book in the computer science, analytics, engineering, and data science worlds.
It’s a standard phrase: garbage in, garbage out, and we like it so much we use the GIGO acronym quite liberally. This, in contrast to some others, is actually a good use of an acronym.
The idea here is that if you have bad data, you won’t be able to get good results from your data warehouse, your model, your IOT devices, or whatever.
The phrase is supposed to be an inspiration to “clean up your data!” and make the “garbage” on the front end go away.
The purpose of this is to ensure that you have a “quality” product going in. The assumption is that simply putting quality in, we’ll get quality back out. If we rephrase, then, with the transformation from garbage to quality, we get “Quality In, Quality Out.”
But is this really true?
Nope. Not really.
Just because you have quality data going into your model, your system, or your process, that’s no guarantee that you’ll get a quality output. There are many reasons for this, but let me lay out just 3 for right now. I’m going to call them QIQO Fallacies, and hope that the phrase quickly becomes as ubiquitous as “Brangelina” or “WTF.”
QIQO Fallacy #1 – Solving the Wrong Problem
This happens when the stated problem is not the real problem. It shows up when someone, often far removed from an actual implementation team, says something like, “Hey, we’ve got all this data. I bet there’s a way to figure out how much [something] we’re [something else].” And then, the implementation team is tasked with figuring out the somethings and the something elses, but without a clear reason why.
Often this shows up because non-modelers have a very dim view of what data models can do, what they should do, and what they should not do. They assume that since there’s data, there must be something to be done with that data, and all you have to do is search long enough and you’ll find it.
Beware, though, of statistical anomalies that aren’t really there:
How to fix it: Ensure that you have enough discussion before and during the model building that everyone understands the reason for building a model in the first place. It’s not enough to go just exploring. That’s likely to lead to spurious correlations that are not only wrong, they’re potentially dangerous to your business. Make sure you have a plan with a reason, and execute accordingly.
#2 – Solving the Problem Without Actionable Insights
This happens when you have a model that may answer the question that was asked, but the answers aren’t actually providing insight on how to change your future actions as a result.
Like, for instance, say you’re tasked to build a leading indicator model for some phenomenon that’s going to happen in a week or so. You give it your best and when you’re finished, you’ve got a great model! Your predictors give, with a high accuracy and small error bound, an expected range for the target value that’s going to show up in 6 or 7 days.
But – is that doing you any good? Do you have enough time to make any kind of shift or pivot to avoid that bad future?
7 days may not be long enough in some industries, like health care, geopolitics, or urban planning. And it might be an eternity in very short life-cycle industries like social media trends or stock trading.
The point is, whatever solution you’re searching for, needs to not only exist, it needs to be something about which you can take action. [Okay, that was a bit of a cumbersome sentence. Let’s try again.]
Just knowing about a future that you can’t affect doesn’t do much for you. You’d be better off not wasting your time on data preparation or model building if there’s nothing you can do to avoid those future predicted results.
How to fix it: Ensure that not only is your problem fairly well-defined when you start, you know that there’s a business reason for the solution you’ll deliver. Be clear that there are some actions that can be taken as a result of having the model outputs. Be confident that you’re not just building a model for the sake of a model, but that there’s going to be some insight about modifying the a priori future result which you’ll learn each time you run the model, and some kind of recommendation for different action you can make as a result of those insights.
#3 – Clogs Inside the Pipes – Bad Formulas and Poor Programming
This is the last one that I need to warn you about. If you’ve spent all your time figuring out how to appropriately “clean up” your data sets when they come in, and very little making sure that your model itself is up to code (ha! Get it? Up to “code”? I kill me.), you’re wasting your efforts. Because even if you have high quality data, running through an inefficient model isn’t going to be the best use of your time.
Essentially, if you start with a good data set but then send pieces through multiple transformations, ports, copy-paste steps, and generally bad practices, you’re introducing lots of different opportunities for inefficiency. If you’re setting it up so that output from one report is an input to another, you’re adding unnecessary steps.
Just like the clogged pipe above, if your model is forcing your data through unnecessary intermediate steps just to get a final value, you’re slowing down your process. You’re adding extra complexity, and you’re creating additional opportunities for something to break once you go live. You don’t want that. Bad models are just as bad as bad data.
How to fix it: Remember to always apply good modeling and programming practices (efficiency, parsimony, clarity, documentation, etc.). This is a way to ensure that once you do get your data sets appropriately identified and documented, you’re not slowing down the harvesting of insights (actionable insights, of course!) with poor modeling practices.
So, What Do You Think?
Am I way off? Or right on? Do you agree that QIQO is not guaranteed? Or do you think that we can get quality out just by putting quality in?
Send an e-mail to email@example.com and let me know your thoughts. I’d love to have another professional debate this issue and point out where my thinking is off base.