|
Business Guide to Data Integrity -

Business Guide to Data Integrity
From the earliest days of computing, boffins have talked about Gigo - Garbage In, Garbage Out, meaning that a sow's ear of a database won't give you a silk purse. With the concentration of data in warehouses the problem looms larger than ever. Priscilla Awde reports
It is a task of Herculean proportions to create an enterprise-wide data warehouse. Centralising data in one warehouse may involve re-engineering or re-creating an entire corporation from the bottom up, from a company's most valuable asset?data.
Despite the enormity of the task, it is becoming essential. Implemented and properly used, data warehouses can transform even the largest corporation.
Giant and successful companies are outgrowing their internal organisation so they can no longer react to new business methods quickly or flexibly enough. While the business world has radically changed, many of the largest corporations have not. New companies, benefiting from state-of-the-art technologies, are competitive and innovative?representing a significant threat to older corporations' viability.
Building warehouses or datamarts is an attempt to impose a new form of order. It is about holding data in a central place so users can interrogate, analyse and slice/dice it to produce significant business decisions fast and accurately. Enterprise data warehouses give a whole new insight into the way a business is actually operating. Creating one is primarily about avoiding the 'garbage in, garbage out' syndrome. Without accurate, valid and reliable source data, a data warehouse is a waste of time and money. Otherwise, a warehouse will propagate the same mistakes resident in existing data and yield unreliable results because the analysis was performed on fundamentally incorrect data.
Data warehouses must contain clean data, conform to an overall architecture or plan and will succeed only if enough effort is put into understanding the data and the business rules governing it. Around 80% of the time should be spent on data integrity; if not, the project will fail. Business and IT staff must collaborate first to identify which data is important, then to ensure it is consistent, accurate, reliable, verifiable and not duplicated. Projects fail when IT people work alone. The end product is the information which supports the strategic and decision processes of the enterprise.
It may be possible to bite off chunks of a data warehouse project by creating smaller datamarts, but both require considerable preparatory work involving finding, cleaning and enhancing existing corporate data.
Danger of duplication
"Datamarts are popular because they are easier to set up and because it has been difficult for IT departments to justify the upfront costs of ensuring data integrity across the corporation. There have been a number of data warehouse failures in the UK because companies haven't prepared the data," says Roop Singh, director marketing, Northern Europe, Ardent. "But several datamarts each addressing different areas create a worse problem for data integrity because data is duplicated in these singular databases. Datamarts can cause longer term problems for managing data and for its accuracy. Enterprise Application Integration (EAI) is then needed to get datamarts working together."
The first, most important, expensive and time-consuming task (and one frequently ignored or done incompletely), is to discover what a company needs to know, what it wants to ask of the data and what business rules govern the data. What are the business problems a data warehouse will solve?
"Implementers have to think big?consider where they want to be in five years?but build small. But they still have to understand all the data sources and need to get business people involved in what they want to do, find out what information they require and then go back from that. Lots of data warehouses start the other way around and fail," says Martin de Ville, director of data warehousing for Oracle. "Analysts need to find where the data is held in the company or add software to collect it. It's possible to deliver quick wins, but solutions must be within the large system. People are still building quick and dirty solutions which will let them down in a couple of years."
Data mapping applied methodically will determine what data is available, where and how it must be prepared for a warehouse. It must also be 'cleaned', to guarantee accuracy, and transformed into the right format before being loaded into the warehouse.
The good news is that there are Extract, Transform and Load (ETL) tools to automate this process, although some manual intervention is commonly required to check inconsistencies or duplications.
"People are still building quick and dirty solutions which will let them down in a couple of years" Martin de Ville
"Tools extract wanted data, determine what is logically wrong and what needs to be fixed. They use a scoring technique based on predetermined assumptions of what is a valid record, throwing out inaccuracies for checking," explains Mike Blake, data marketing consultant at IBM. "These tools reduce time and cost factors?it's impossible to check 6 million records without automation and companies need all data correct. ETL is a hot topic. These technologies are also useful for mergers/acquisitions as well as for the business intelligence systems which make customers king. Companies which act like they have a market of one will achieve better business performance. Technologies exist to make each customer an individual."
If the right procedures are in place to clean up the data the results can be dramatic. Properly furnished, data warehouses can convert bureaucratic organisations into nimble corporations able to react fast to change on all levels, target sales and marcoms activities with pinpoint accuracy, bring the right products to market faster and compete much more effectively. Once a warehouse is stocked, managers have the opportunity of a global and dynamic view of the company; they can analyse, interrogate and mine the data to arrive at answers to the most complex questions.
Chance of a fresh start
"The key challenge is not to make the same mistakes that were made in the poor design of operational source systems - the functional databases which support one set of users, not one set of data shared across the enterprise," believes Larry English, president and principal of the US-based Information Impact International. "When databases were first built there were no data sharing technologies, but people continued to build these rigid, inflexible, incomplete and separate islands of automation which accounts for the mess most organisations are now in. Developing data warehouses presents an opportunity to create enterprise-wide capability to share information?it may be the last one because management may never again be willing to spend the millions required to do this."
Another major driver behind warehouses is that the information explosion makes rationalising corporate data essential. People are drowning in too much information, making it difficult to carry on as usual. Data should be a company's most valuable asset but that value can only be realised if it is properly organised and is made to work for; rather than against, the corporation.
"In competitive environments, access to information in a timely manner is as important as accuracy. Cost is third in level of importance," believes Reynolds Bish, president and CEO of Captiva Software.
"Historically, IT departments have invested heavily in front-end systems but organisations are starting to realise this is costly and the resulting information poor. So the focus is starting to shift to the data integrity and capture side," Bish explains. "However; the number one impediment to implementation is poor data quality at the start, which leads to user suspicion of the system. In cleaning data, the whole point is to minimise human manual intervention which is costly, not as fast as computers and subject to inaccuracies."
Most existing data comes in two forms: structured or unstructured. Both need to be part of the warehouse. While all systems contain structured data (tables, forms, charts), fewer include unstructured data (faxes, Web, Internet, emails, word processed documents). The latter contains important information but is more difficult to store and manipulate. Part of the process of cleaning data is to create an audit trail, or metadata, derailing everything that has happened to the data. Metadata is sometimes described as the 'glue holding the data together' - it is data about data, showing where it originated and how, where and when it changed. Without metadata no-one can confidently evaluate the contents of a data warehouse. Metadata gives users confidence ? they know the data on which they are working is accurate. This is an idealised view. In reality the process often goes wrong. The reasons have mostly to do with inadequate or incomplete data preparation. Most companies are still organised along functional business lines, with each part of the organisation working off its own data repository. The result is that data is held in disparate databases spread throughout the company, often on incompatible hardware and software platforms.
Reconciling multiple sources
As companies grow and change, these legacy or source systems become cumbersome and inadequate in a changing business environment. To make them do jobs for which they were never designed but which are now essential, users have added applications and IT staff have written extra code.
"Data integrity is difficult because change is happening faster, making the problem of data management three times harder and more important" Dr. Kay Hammer
"Most companies have bought software products or written their own. The result is completely different systems for different departments, each with rules relevant to their own departments but with variant definitions of the same query. So the same question will yield different answers depending on where in the organisation the question is asked," explains Keith Prince, European business intelligence marketing manager of Hewlett Packard. 'A data warehouse needs to get one definition of what the demand is, so companies have one consistent view of the whole organisation. Data preparation must remove the variety of definitions which exist in departmental models where data and its relationships are mapped differently."
While the historical muddle allows a company to operate, it contributes neither to user satisfaction nor to corporate competitiveness. In even the most efficient and best run companies which have maintained and updated their systems, a high percentage of the data is inaccurate, duplicated in other systems and generally not shared between databases.
|
English Lesson
Larry English of Information Impact International suggests the following Keys to Data Architecture:
- Involve business and IT experts in data modelling to define the meaning of data and its business rules to define the architecture correctly
- The product of data warehouses is information to support workers
- Data warehouse architecture must be built as an enterprise data structure
- Source data must be analysed and its uses understood
- Data must be cleaned and correct at both source and warehouse
- Transformations from source to target must be carefully mapped
- Audit and control ETL processes ensure data is properly handled at each stage
- User needs must be a point of focus and feedback from them is essential.
|
"Operational systems have a degree of tolerance for incorrect data, but this is not so in the data warehouse environment where business decisions are extrapolated from the data via data mining. Incorrect source data can radically affect the business outcome," explains David Wells, a principal consultant at Ovum which has published extensive reports on data warehousing and ETL.
"While corporations may have changed, legacy systems haven't; they have been hacked around to make them fit present needs," adds Wells. "Data may mean different things from what was originally intended, could be entered in clumsy ways or in duplicate. People need to ensure data means what it really means not what clerks know it means. In data warehousing projects, it is necessary to know a lot about the legacy systems, to figure out how they work, and only the more expensive ETL tools help with this?none is comprehensive. Front-end vendors are partnering with ETL companies and moving toward one-stop-shop solutions for the whole data warehouse project."
Legacy systems may suffice in non-competitive business environments or when a company is content to carry on business as usual. Today, most businesses operate in a climate characterised by rapid change, deregulation, competition and mergers and acquisitions. They need to act fast in global markets.
Dr. Kay Hammer, president and CEO of ETI, explains: "Data integrity is difficult because the rate of change in businesses, including mergers/acquisitions, is happening faster and more often, making the problem of data management three times harder and more important. The amount of data held or stored in different systems is exploding. The people who designed the source systems may have died, retired or moved on leaving few clues about what they did.
"Moving data from source systems means changing data values to fit them into a new format ?this is complex and needs metadata," says Dr. Hammer. "Correlation lists show the links between source system information and warehouse data. The evolution towards data consistency is like the struggle towards world peace - we have to try, otherwise the results are unacceptable." Moving data between legacy systems and the new technologies can create problems unrelated to data accuracy. Rob Wilson, sales and marketing director of Acuma Solutions, says: "Production systems may not be able to produce data fast enough or in the format required? there may be hardware constraints to moving large amounts of data around an organisation which must be understood. Legacy systems may allow incorrect data to be entered; there tends to be lots of rubbish in existing systems. We need to understand why errors are happening and correct them at source. If there are problems with the data, then it is impossible to go ahead with the project until they are sorted." Even accurate source data can change to become inaccurate when it is moved between systems.
Satisfying customers is now considered of paramount importance. Internet and Web-based technologies have changed business methods and more products are developed to stay competitive, all of which creates more data. Old legacy systems are less and less able to cope and their inflexibility is impeding corporate efficiency and competitiveness.
"As the number of data sources (mainframes, servers or Unix platforms) increases, then so do the problems of data integrity," says Singh.
"In multinationals the same products can have different catalogue numbers in different countries," he explains. "There are problems of definition between sites or databases. People need to identify both the data which needs to be manipulated and the ETL tools available to do it. In the past, people wrote their own code to extract and transfer data, but each time data changed they wrote more code which quickly became unmanageable and costly to maintain. Quality manager tools tell operators what data has gone wrong and where, so faults can be rectified manually or automatically."
When no-one knows what data is held, where it is or whether it is accurate, no-one can get a complete or accurate view of the company, thereby perpetuating inefficiency and bureaucratic mindsets. The integrity of data becomes critical once a data warehouse is built.
Automated checking
"Typographic errors and data duplication are common and must be eliminated says HP's Prince. "Names may be miss-spelled or entered twice with different addresses or may have the right postcode with the wrong town, The same customer may appear in three different databases and accurately entered in only one. Typos are fairly easy to check by cross reference and by questioning data within a company. Existing software packages can throw up inconsistencies or identify numbers which are too long. Records can be verified by outside organisations which maintain comprehensive, accurate and up-to-date files to determine creditworthiness.
"70% of the data warehouse project effort goes into tidying up data and clients don't understand this" Stephen Gallagher
"Projects can fail because of analysis paralysis - it's possible to spend too much time trying to define data which becomes too complex to manage with no extra value. Data warehouses are an organic state - they should give a dynamic view of the organisation, allowing the company to become very agile market leaders. Data should evolve and change with the corporation."
"There have been a number of data warehouse failures in the UK because companies haven't prepared the data" Roop Singh
Summing up, Stephen Gallagher, an executive consultant at KPMG Consulting, says:" 70% of the data warehouse project effort goes into tidying up data and clients don't understand this. Many don't believe they have a problem?they only see it when you hold up a mirror to the organisation showing them the errors resident in the data. Often clients get discouraged by the inaccuracies. "The worst thing people can do is think creating a data warehouse is a simple process," Gallagher insists. "Projects must be broken into bite-sized chunks and it should be possible to deliver the first mission-critical system within three months. There is a massive need to get a single view of the business and eliminate disparate databases. Companies need to get smart about what's inside their systems."
Data warehouses could evolve into the brains of an organisation. How clever or stupid they are will depend on the accuracy of the data they contain. +
|
Legal niceties
Data needs to conform to Data Protection requirements to have complete integrity
Data protection legislation controls how information is used, what can be collected, who should use it and for what. Companies are collecting ever more customer information in return for customised service. It is a trade-off and there are fewer chances of opting out. "Any organisation working in Europe has to conform to the same levels, but because America is not at the same level as Europe, some data cannot be shared between the two regions," explains Jennifer Major, data warehousing product manager at SAS Institute.
"Companies need to ensure that, at the international level, information sharing doesn't violate any acts," she adds. "It is necessary to involve corporate data protection experts when building and planning data warehouses. Data can only be used for the purpose for which it was collected, making it tricky for organisations wanting to cross-sell or use intelligence held within one part of the business to help another side, for instance retailers which open banks."
In the UK, companies have to tell data protection standards bodies what information they hold on individuals, what they do with it and which other companies have held that data.
"There are deep discussions at US and EU government level to arrive at a global system for privacy," says Hewlett-Packard's Prince. "Data protection legislation needs to cross national boundaries because the Internet does. Codes of practice need to be enshrined in an official standards body and, in future, I would expect to see legislation to penalise the wrong use of data. There must be legal redress against the wrong use of data. The good use of data means companies can quickly identify alterations of normal individual patterns to protect against fraud."
There is a new Data Protection Act going through the UK Parliament to strengthen the 1994 Act with requirements from the EU directive on data privacy. Compliance is due in 2002. "There are two key changes: all data held anywhere within an organisation (not just computer information) comes under the Act; there are cross-border controls for transferring information to countries which have no data protection legislation," explains IBM's George Thompson. "Details like ticking boxes to prohibit use of information for marketing have to be put somewhere in data warehouses." |
|