Data Smells

Developers are familiar with "code smells" --the little signs you see upon superficial examination of code that lead you to fear deeper pathalogical problems.  Over time, many developers become pretty good at spotting these signs, and volumes have been written about how to address these problems once they're detected.

But code smells aren't the only signs that problems are lurking in your system.  Most systems with even moderately complex data models can hide all sorts of problems in their data.

Skunk works logo
Image via Wikipedia

A good system, of course, will be coded defensively, such that it can tolerate, or maybe even fix bad data.  This is feasible and practical in small to mid-sized systems, but it becomes increasingly difficult as systems become larger and more complicated.  In all but trivially-small applications, bad data is a very real problem.

Like bad code, bad data is sometimes bad in very subtle ways.  Database constraints can (and should) be used to prevent obvious problems with things like unique id's, foreign-key references, required values, and so on.  This is a minimal requirement, but it won't help you deal with data that violates complex business rules (ex: an order must have an associated invoice if the status of the order is "placed order").

Typically, you'll find examples of data rule violations when you're diagnosing error, or maybe when you're doing reporting or data analysis.  When an instance of bad data is discovered, you've really got only two ways to deal with the problem:  Fix the code, or fix the data.

Fixing the code is often our first reaction, since we're (generally) more comfortable working in code than data in the first place.  We'll often go to the source of the error, and we'll change the code to tolerate this particular class of bad data, but it's important for us to ask ourselves if this is truly a fix for the problem:

  • Just because we fix the code in one place, how can we know we won't blow up somewhere else because of the same bad data?
  • If there's really a business rule governing this data, how are we helping by tolerating violations to those rules?
  • Are the business rules governing this data known at all (I know this sounds silly, but it's going to be a valid question more often than you might think)?
  • How did the bad data get into the system to begin with (is there a real bug upstream that's allowing bad data to be created)?

So in some cases, at least part of the problem is to fix the bad data.  Again, there are some questions you should pose of your system before you dive headlong into SQL:

  • Are there other instances of this data corruption?  How many?
  • What are the circumstances of the problem?  Is there a way to predict the scope or context of the problem?  Perhaps the context can lead you to the source of the data corruption.
  • Can the data be fixed at all?  Sometimes, the damage is irreversible, and repairs can be quite difficult.

As you may have gathered by now, the sooner these issues can be nipped in the bud, the better off you'll be.  I'll cover some strategies to help you with this in a future post.

Reblog this post [with Zemanta]

One Reply to “Data Smells”

Comments are closed.