In survey after survey, you rate poor data quality as a top concern for data warehouses. Yet when it comes time to open your wallet, there are suddenly more pressing issues. Do you honestly think ignoring the problem will make it go away? Get real. It’s time to put on the old hip boots and wade into the muck. Either that or end up with mud on your face when your $10 million warehouse grinds to a halt because no one trusts the data.
“People think building the data warehouse is the silver bullet and that quality will be taken care of as part of that process,” says Danette Taggart, data quality and value program manager for Hewlett-Packard Co., in Palo Alto, Calif. “There is a lack of awareness [about quality] among the people who are making the funding decisions.”
That lack of awareness is going to lead to some major data warehouse mudslides, according to data quality experts and consultants. The only reason the cleanup crews aren’t yet out in full force is that most companies aren’t far enough along in building their warehouses, says data warehousing guru Doug Hackney, president of Enterprise Group Ltd., of Hudson, Wis. “Come back in 12 to 18 months,” he says.
One company Hackney declined to name has spent five years and $15 million on its warehouse, only to see it sit dormant because of dirty data. “They’ve spent the last year and a half going back trying to scrub the data,” he says. “There is zero utilization because no one trusts it. … And there are massive political consequences.”
That’s a polite way of saying people are getting canned. And IT staffers are particularly vulnerable. Business units may have created the dirty data in the warehouse, but the finger almost always gets pointed at IT because it built the warehouse, Taggart says. It’s enough to make an IT manager avoid a data warehousing project altogether. But all hope is not lost. There are lots of options to clean up the grime: tools, consultants and, most important, free advice from peers. Here are the lessons and best practices of four major companies based on their data warehousing experiences.
At 1.5 terabytes, the data warehouse for the National Association of Securities Dealers is among the largest in the world and is host to business stock market information such as quotes, trades and orders from NASD’s stock market subsidiary, NASDAQ. “Making sure we produce and consume high-quality information is absolutely imperative for having a competitive edge within the industry,” says Tiba Soltani, data quality practice manager for NASD, in Rockville, Md.
NASD’s solution was a data quality certification program, established in March 1996, which encompasses seven elements, including data stewardship. This piece ensures that once data is cleaned, certain individuals are given responsibility for maintaining its purity.
Creating a formal program also creates a central place where all IT staffers and business managers can go for help with data quality. “It eliminates the need for each project to come out with its own processes, find its own tools and deal with consultants separately,” Soltani explains.
The central component of NASD’s program is the data quality certification process, which includes assessment, improvement and certification stages. Assessment and improvement sort of go hand in hand. NASD uses Prism Solutions Inc.’s data cleansing tool, Prism Quality Manager (formerly QDB), to access tables in its relational database. The tables are then run against a set of standards–such as validity and completeness, business rules, and structural integrity–and Prism Quality Manager identifies which records are in error. NASD makes the fixes and certifies the table.
In the first pilot using the program on a table with about 20 million records, NASD was able to reduce its error rate from 7 percent to 0.01 percent, Soltani says. “Our other data information systems are going to have this program implemented as a result of the success of the pilot,” she says.
Once a table is certified, it’s not the end of the process. NASD goes back and recertifies tables when new data is added, she says.
There is a point, however, where companies need to be realistic about data cleanliness. KFC Corp.’s IT department learned that lesson when it wrote some data-scrubbing programs to cleanse data before importing it into a 180GB warehouse. Despite the elbow grease, dirty data got through anyway. “The first three to six months [into production] were rather intense in re-scrubbing the data,” says Thomas Rosing, manager of decision support in KFC’s information technology department in Louisville, Ky. “Now, 20 months into the project, we still have data issues we’re uncovering.”
The issue isn’t that KFC should have caught the problems the first time around–it’s impossible to be perfect. But what KFC’s IT department can and did do was have a realistic understanding that cleaning its data warehouse is an ongoing process.
One of the reasons KFC’s IT department has been so successful catching and cleaning up dirty data is that it ran a three-month pilot before its data warehouse went live. “If I had to counsel anyone else, I would insist they do proof of concept,” Rosing says. “That gave us a jump-start on resolving some of the more significant issues with the data and made life easier once we went into full production.”
The pilot ran for 90 days with two users from KFC’s strategic planning group. Even with just two users banging on the warehouse, IT was able to reduce errors by an “exponential” amount, Rosing says, although he declined to provide an exact figure.
Automation is another practice KFC has embraced to attack one of the main contributors to dirty data: human error. A data processor might accidentally hit the wrong key on a keyboard or type in a series of meaningless numbers because a program demands that all fields be filled.
Now, KFC automates data entry whenever it can. For example, it was getting unreliable data from the field about new security systems installed in its restaurants. It was able to go around the old technique of gathering reports from field reps by tying its data warehouse directly to its accounting system, which tracks fixed assets. Now the data is completely accurate, and it’s updated automatically, Rosing says.
One you’ve nailed down the processes for keeping a tidy data warehouse, there’s always the sensitive issue of funding. Can’t get sign-off from top management on a budget for data cleansing tools or for hiring more staff? Talk to your business manager in his or her language: dollars and cents. “We’ve developed a methodology for determining the quality of information and what the value is, because they really are intertwined,” says HP’s Taggart.
In one project, Taggart and her team established the value of lost sales based on misaddressed marketing materials. They took a statistical sampling of mailings done earlier in the year and had businesspeople come up with the potential dollar amounts lost due to faulty data. “In this particular study, out of the total cost [of dirty data], 4 percent was related to the direct costs, such as materials, printing and postage, but 96 percent was related to lost revenue,” Taggart says.
In another case, the data quality team was able to show a sales and marketing department that the value of its data warehouse would increase by a stunning 84 percent if it improved its quality to “near perfect,” Taggart says. She declines to put a dollar figure on the improvement, but says it’s “significant.”
If you’re not sure where to start, try calling in consultants. Even Goliath HP found it worthwhile to bring in a data quality expert, Larry English, president of Information Impact International Inc., of Brentwood, Tenn. English helped Taggart develop HP’s data quality methodology. “He taught us a lot about the basics of data quality, bringing in his experience in other real-life businesses,” Taggart says. But it isn’t enough to bring in a consultant and have him or her do all the work. IT staffers and business users must work with the expert to share their knowledge of their specific situation, she says.
Partnering is extremely important, especially with users, she adds. Remember, IT is on the hook for dirty data. “Each side must recognize the value and the level of expertise the other provides,” Taggart says. “You can’t leave any of the pieces out if you want to be successful.” For example, “IT by itself will have a difficult time getting the funding for a data quality campaign,” she says. A business case must be made, and the only way to do that is to work directly with the users of the data.
It’s equally important to be on good terms with users because it’s easy for a schism to form between IT and business units when IT audits the units’ data. Users must feel they’re not the subject of a witch hunt and that they can be fully open in discussing quality issues, Taggart says. Otherwise, an adversarial relationship may develop.
Users don’t want to be blamed for problems, so they may question the results of a data audit or not be as forthcoming with information, she explains. “If problems are found, people feel defensive,” she says. “You need to make it clear that you want to do data quality not because you want to find out how bad the data is, but because you want to find out where to make investments. The key for the IT staff is to work with upper management to foster an open environment that doesn’t penalize people for telling the truth.”
One trap to avoid at all possible costs is the pursuit of perfect quality. “You’re never going to get it perfect,” insists Michael Scofield, senior manager of data architecture for DirecTV, of El Segundo, Calif. The key issue is to understand how data is going to be used and what constitutes an acceptable error rate. “A lot of business analysis can be done with plus or minus 1 percent error in your statistics,” he says. Using that as a guideline, you might ask if it’s acceptable to have a 1 percent error rate in state codes and county codes, he says. If you’re doing a direct mail drop for 10,000 customers, it’s probably OK. But if you’re doing data mining on 1 million customers to determine trends for a major new strategy, 10,000 incorrect entries may be enough to skew the results.
Scofield also suggests approaching data with a skeptical attitude. “That’s a bigger issue than the tool you’re using,” he says. “Are you creative enough to imagine all the ways data could go wrong?”
The biggest issue of all may simply be mindset. In one way or another, the solution to data quality problems seems to always come back to breaking out of old patterns of thinking, whether it’s in IT or the ranks of top management. One data quality tools vendor says IT tends to get “get seduced by the flashing lights of technology and not be as concerned about the data.”
It’s a self-serving statement, but, based on the overwhelming evidence, it’s true. IT managers need to break out of that mindset of simply focusing on speeds and feeds. The potential for dirty data to bring down a warehouse is all too real. And when the dust clears, the most likely person with dirt on his or her face will be the head of IT.