Thursday, December 13, 2007

Definition of Accurate Data

Data Quality Definitions

Data quality is defined as follows: data has quality if it satisfies the requirements of its intended use. It lacks quality to the extent that it does not satisfy the requirement. In other words, data quality depends as much on the intended use as it does on the data itself. To satisfy the intended use, the data must be accurate, timely, relevant, complete, understood, and trusted.

Some examples will help in understanding the notion of data quality in the context of intended use. The sections that follow explore examples of the previously mentioned aspects of data integrity.

Case 1: Accuracy

Consider a database that contains names, addresses, phone numbers, and e-mail addresses of physicians in the state of Texas. This database is known to have a number of errors: some records are wrong, some are missing, and some are obsolete. If you compare the database to the true population of physicians, it is expected to be 85% accurate.

If this database is to be used for the state of Texas to notify physicians of a new law regarding assisted suicide, it would certainly be considered poor quality. In fact, it would be dangerous to use it for that intended purpose.

If this database were to be used by a new surgical device manufacturer to find potential customers, it would be considered high quality. Any such firm would be delighted to have a potential customer database that is 85% accurate. From it, they could conduct a telemarketing campaign to identify real sales leads with a completely acceptable success rate. The same database: for one use it has poor data quality, and for another it has high data quality.

Case 2: Timeliness

Consider a database containing sales information for a division of a company. This database contains three years' worth of data. However, the database is slow to become complete at the end of each month. Some units submit their information immediately, whereas others take several days to send in information. There are also a number of corrections and adjustments that flow in. Thus, for a period of time at the end of the accounting period, the content is incomplete. However, all of the data is correct when complete.

If this database is to be used to compute sales bonuses that are due on the 15th of the following month, it is of poor data quality even though the data in it is always eventually accurate. The data is not timely enough for the intended use.

However, if this database is to be used for historical trend analysis and to make decisions on altering territories, it is of excellent data quality as long as the user knows when all additions and changes are incorporated. Waiting for all of the data to get in is not a problem because its intended use is to make long-term decisions.

Case 3: Relevance

Consider an inventory database that contains part numbers, warehouse locations, quantity on hand, and other information. However, it does not contain source information (where the parts came from). If a part is supplied by multiple suppliers, once the parts are received and put on the shelf there is no indication of which supplier the parts came from. The information in the database is always accurate and current. For normal inventory transactions and decision making, the database is certainly of high quality.

If a supplier reports that one of their shipments contained defective parts, this database is of no help in identifying whether they have any of those parts or not. The database is of poor quality because it does not contain a relevant element of information. Without that information, the database is poor data quality for the intended use.

Case 4: Completeness

A database contains information on repairs done to capital equipment. However, it is a known fact that sometimes the repairs are done and the information about the repair is just not entered into the database. This is the result of lack of concern on the part of the repair people and a lack of enforcement on the part of their supervisors. It is estimated that the amount of missing information is about 5%.

This database is probably a good-quality database for assessing the general health of capital equipment. Equipment that required a great deal of expense to maintain can be identified from the data. Unless the missing data is disproportionately skewed, the records are usable for all ordinary decisions.

However, trying to use it as a base for evaluating information makes it a low-quality database. The missing transactions could easily tag an important piece of equipment as satisfying a warranty when in fact it does not.

Case 5: Understood

Consider a database containing orders from customers. A practice for handling complaints and returns is to create an "adjustment" order for backing out the original order and then writing a new order for the corrected information if applicable. This procedure assigns new order numbers to the adjustment and replacement orders.

For the accounting department, this is a high-quality database. All of the numbers come out in the wash. For a business analyst trying to determine trends in growth of orders by region, this is a poor-quality database. If the business analyst assumes that each order number represents a distinct order, his analysis will be all wrong. Someone needs to explain the practice and the methods necessary to unravel the data to get to the real numbers (if that is even possible after the fact).

Case 6: Trusted

A new application is deployed that is used to determine the amount and timing of ordering parts for machinery based on past history and the time in service since last replacement for the machines they are used in. The original application had a programming error that incorrectly ordered 10 times the amount actually required. The error went undisclosed until a large order was sent. A great deal of publicity ensued over the incident. The programming error was fixed and the problem does not repeat.

The database was never wrong; the application was. The large order was actually placed and the database reflected the order as such.

Because of a fear of a repeat of the incident, the maintenance chief has chosen not to use the application nor the information within the database. He orders parts based on a small spreadsheet application he built to keep much of the same information, even though he often misses transactions and does not always know when new parts arrive in inventory.

Unless his confidence in the original application is restored, the database is of poor quality, even though it is entirely accurate. It is not serving its intended use due to a lack of believability.

Principle of Unintended Uses

The previously cited examples demonstrate that you cannot separate data from uses. To assess the quality of data, you must first collect a thorough specification of the intended uses and then judge the data as to its suitability for those uses. In a perfect world, database builders would gather all requirements and then craft a database design and applications to match them.

In the real world there is a serious problem when dealing with "unintended uses." These are uses that were not known or defined at the time the databases were designed and implemented.

Unintended uses arise for a large variety of reasons. Some examples follow:

  • The company expands to new markets.

  • The company purchases another company and consolidates applications.

  • External requirements are received, such as a new tax law.

  • The company grows its usage, particularly in decision making.

  • The company migrates to a new, packaged application that has different needs.

This represents the single biggest problem with databases. Unintended uses proliferate at ever-expanding rates. You cannot anticipate all uses for a database when initially building it unless it is a database with unimportant content. In the real world you can expect (and in fact depend on) a large number of unintended uses appearing with surprising regularity. Each of these can cause a good database to become a bad database. Two things are needed: anticipation in database design and flexibility in implementations.

Need for Anticipation in Database Design

Database designers need to be schooled in the principles of data quality. By doing so, they will be able to avoid some data quality problems from occurring when unintended uses appear. In the least, they should be schooled to be diligent in the careful and thorough documentation of the content. This means that metadata repositories should be more prevalent, more used, and more valued by information systems groups.

Anticipation also includes questioning each database design decision in light of what might appear in the future. For example, name and address fields should anticipate the company's growth into markets in other countries where the structure and form of elements may vary. Another example is anticipating sales amounts in multiple national currencies.

Database and application designers should be discouraged from using confusing and complicated data encoding methods. Many bad techniques proliferated in the past that were the result of the limited capacity and slow speed of storage devices. These are no longer excuses for making data structures and encoding schemes overly complicated.

A good database design is one that is resilient in the face of unintended uses. This principle and the techniques to achieve it must be taught to the newer generations of information system designers.

Need for Flexibility in Implementations

We know that changes will be made to our systems. Corporations always change. They change so much that keeping up with the changes is a major headache for any CIO. Unintended uses is one of the reasons for change. When a new use appears, its requirements need to be collected and analyzed against the data they intend to use. This needs to be done up front and thoroughly. If the database is not up to the task of the new uses, either the new use needs to be changed or discarded, or the database and its data-generating applications must be upgraded to satisfy the requirements. This analysis concept needs to be incorporated into all new uses of data.

It is amazing how many companies do not think this way. Too often the data is force fit into the new use with poor results. How many times have you heard about a data warehouse project that completed but yielded nonsense results from queries? This generally is the result of force fitting data to a design it just does not match.

Database systems are better able to accept changes if they are designed with flexibility in the first place. Relational-based systems tend to be more flexible than older database technologies. Systems with thorough, complete, and current metadata will be much easier to change than those lacking metadata.

Most information system environment do a very poor job of creating and maintaining metadata repositories. Part of the blame goes to the repository vendors who have built insufficient systems. Part goes to practitioners who fail to take the time to use the metadata systems that are there. Part goes to lack of education and awareness of how important these things really are.


Data Accuracy Defined

Data accuracy is one of the components of data quality. It refers to whether the data values stored for an object are the correct values. To be correct, a data values must be the right value and must be represented in a consistent and unambiguous form.

For example, my birth date is December 13, 1941. If a personnel database has a BIRTH_DATE data element that expects dates in USA format, a date of 12/13/1941 would be correct. A date of 12/14/1941 would be inaccurate because it is the wrong value. A date of 13/12/1941 would be wrong because it is a European representation instead of a USA representation.

There are two characteristics of accuracy: form and content. Form is important because it eliminates ambiguities about the content. The birth date example is ambiguous because the reviewer would not know whether the date was invalid or just erroneously represented. In the case of a date such as 5 February, 1944, the USA representation is 02/05/1944, whereas the European representation is 05/02/1944. You cannot tell the representation from the value and thus need discipline in creating the date values in order to be accurate. A value is not accurate if the user of the value cannot tell what it is.

Value Representation Consistency

Two values can be both correct and unambiguous and still cause problems. For example, the data values ST Louis and Saint Louis may both refer to the same city. However, the recordings are inconsistent, and thus at least one of them is inaccurate.

Why is consistency a part of accuracy? Although the values are different, anyone looking at them would have no trouble interpreting what the values mean. The answer is that inconsistent values cannot be accurately aggregated and compared. Since much of database usage involves comparisons and aggregations, inconsistencies create an opportunity for inaccurate usage of the data. Because all databases have a base level of unintended uses through the opportunity for ad hoc queries, data is inaccurate if it is inconsistent in representation.

Change-Induced Inconsistencies

Inconsistencies can also be caused by system changes that change the way information is recorded or that change the granularity of recording. Following is an example of how change in granularity might introduce inconsistencies: A company has a color field that only records red, blue, and yellow. A new requirement makes them decide to break each of these colors down to multiple shadings and thus institute a scheme of recording up to 30 different colors, all of which are variations of red, blue, and yellow. None of the old records are updated to the new scheme, as only new records use it. This database will have inconsistency of representation of color that crosses a point in time.

These types of changes are often not documented. The point in time of the changeover may also be fuzzy, wherein not everyone switches to the new scheme at the same time. If data mining and business intelligence software is used on this database for periods of time that cross the change, results will be inaccurate.

Valid Values

The definition of a value being valid means simply that the value is in the collection of possible accurate values and is represented in an unambiguous and consistent way. It means that the value has the potential to be accurate. It does not mean that it is accurate. To be accurate, it must also be the correct value. Defining all values that are valid for a data element is useful because it allows invalid values to be easily spotted and rejected from the database. However, we often mistakenly think values are accurate because they are valid.

For example, if a data element is used to store the color of a person's eyes, a value of TRUCK is invalid. A value of BROWN for my eye color would be valid but inaccurate, in that my real eye color is blue.

Missing Values

A data element that has no value in it may be either accurate or inaccurate. For example, a missing value in the data element COLLEGE_LAST_ATTENDED would be blank if the person it applied to had never attended college. On the other hand, a BIRTH_DATE value left blank would not be accurate because all of us have birth dates.

A blank for COLLEGE_LAST_ATTENDED may be accurate or inaccurate. If the person it applied to had attended college, it would be inaccurate. This is another case of valid but not accurate.

Missing values are very problematic in a database because of this characteristic. To be accurate in recording information, an optional data element should allow encoding a value for NULL. This is properly done by creating a separate field that indicates whether the value in the first field is missing or not. In this instance, the value in the primary field would be set to blank if no college had been attended. If the recorder just did not know the answer to the question, the NULL field would be set to indicate YES, meaning that the blank in the primary field did not apply.

Unfortunately, few systems are built to allow for this distinction. Even if they were, most data entry people would not get it right all of the time anyway.

Sometimes a system is created that supports a keyword that means empty. In the previous example, it might be "never attended college." This is a bad practice because it can make queries that do COUNT, GROUPBY, and other commands difficult or impossible to formulate correctly. Thus, it leads to inaccurate query results.

Poor design within a system for not making the distinction between "No value is OK" and "I don't know the correct value" is a common source of poor data quality.

Object-Level Accuracy

The concept of accuracy also applies above the data element level. Data elements are never recorded in isolation. They are value attributes of business objects such as personnel records, orders, invoices, payments, and inventory records. The business objects represent real-world objects or events, and each consists of one or more rows of one or more tables connected through keys. Object-level inaccuracies consist of objects that are missing, have missing parts, or that exist but should not.

For example, if a repair is done to a piece of capital equipment and the repair person just failed to complete and submit a record of the action, there would be a missing row or rows in the database. It is just not there. Any decisions made from the database can be flawed if the missing information would affect the outcome. Missing objects are very difficult to detect. This is an important component of accuracy because a database may have nothing but accurate data but be an inaccurate database because of missing information.

An example of missing elements might be found for an entity consisting of a master record and one or more subrecords. For example, a master record may consist of a record describing a rental object, and the subrecords are instances of renting. If one of the rentals is done without the record being added to the database, the database still looks accurate, but in fact is not.

Missing subrecords are very difficult to detect, whereas missing master records are generally easier to detect. For example, if the rental auto is sold and the master record deleted but not the subrecords, detecting the problem is easy.

An example of objects being present that should not be is a personnel database containing a record for an employee who left the company. The record was just not deleted when it should have been. These are also generally difficult to detect.

Object-Level Inconsistencies

The concept of consistency is also found at the object level. Database are not static. Changes are being made all of the time. Large databases generally have data flowing into them from many different sources. If you have groups creating data with a different criterion for when to add an object (insert) or to remove an object (delete), you can end up with object inconsistency. This means that if you want to use the database to get information about object counts or data aggregations, you have the potential to get inaccurate results.

To guard against this, companies need to define birth and death rules for data objects and try their best to enforce them. They also need to define a method of determining when all of the data is consistent over some time period so that the data can be used intelligently. charts the scenarios discussed in this section in terms of what constitutes accurate versus inaccurate data.

Distribution of Inaccurate Data

The presence of wrong values will generally not be distributed evenly throughout the database. The reasons for this are as follows:

  • Some data is more important than other data.

  • Some inaccurate data tends to get recognized and fixed when used.

  • How an element of data is used will affect the chances of inaccuracies being recognized.

  • Flaws in data acquisition processes are not equal for all elements.

In every database there are data elements that are more important for an application than others. For example, in an orders database, the order number and customer number are more important than the order date. If the customer number is wrong, it will get recognized very early and get fixed. If the order date is wrong, it may never get recognized and fixed. In an HR (human resources) database, the employee's Social Security number is more important than the last education level achieved. A Social Security number error will get recognized and fixed very early, whereas the educational level achieved will probably never get recognized nor fixed. If a large number of errors occur on a frequent basis in important fields, a major issue erupts and the source of the errors is found and fixed.

Another factor is how a data element is used. A field that is used in computing an amount or for updating inventory levels is more important than one that is merely descriptive and is only printed on reports. Computations and aggregation fields will generally precipitate visual clues to errors, whereas fields not used for these purposes will generally not be recognized for their errors.

The tendency for data elements that are more important to be more accurate is why quality problems rarely surface through the initiating transaction applications. The major problems with fields important to the users of that application have already been recognized, and corrective action has been taken to ensure that they are of sufficient accuracy to satisfy their requirements.

The data inaccuracy problem surfaces when this data is moved and used for decision making. Many of the data elements used only to record secondary information about the transaction now become much more important. For example, trying to correlate promotions to educational levels requires that the "education level achieved" field be very accurate. This new use has a higher demand on this data element than the demands made from the HR application.

This is a major reason data suddenly appears to be awful even though the transaction applications have been running for years with no complaints. The new uses of the data place higher requirements for accuracy on some of the data elements than the transaction applications did.

Unfortunately, another dynamic comes into play regarding the chances of getting improvements made. The only way the data will come up to the level needed by the newer uses is for fundamental changes to occur all the way back to the transaction level. And yet, the farther away you get from the initiating application, the more difficult it is to get changes made. The people who own the data are satisfied with the quality and place low priority on complaints from decision support analysts. This situation screams out for data stewards and data quality controls.

Can Total Accuracy Be Achieved?

The short answer is no. There will always be some amount of data in any database that is inaccurate. There may be no data that is invalid. However, as we have seen, being valid is not the same thing as being accurate.

Data accuracy is much like air quality. You can never hope to get to 100% pure air quality within an area where people live and work. It is just not possible. However, most people can distinguish between good air quality and poor air quality, even though both have some level of imperfections. People value higher-quality air over lower-quality air; and they know the difference.

Data accuracy is the same thing. Improvements in the accuracy of data can change the perception of poor data quality to good data quality in most databases even though inaccuracies persist. It is a rare application that demands 100% accurate data to satisfy its requirements.

A database that has a 5% inaccurate data element rate will probably be very troublesome to most users. The same database at a 0.5% inaccurate rate would probably be very useful and considered high quality.

Another important concept to understand is that data inaccuracies arise for a variety of reasons. Some of these are

  • wrong values entered

  • data entry people who do not care to do the job right

  • confusing and contradictory data entry screens or forms

  • procedures that allow for data to not be entered or not be entered on time

  • procedures or policies that promote entering wrong values

  • poorly defined database systems

If you can identify and correct all of the sources except the first one, you can get very high levels of data accuracy. You are left with only the case where I meant "blue" but entered "black". Data entry technology and best practices exist that can minimize the amount of these types of errors as well.

In almost all cases where poor data quality is reported, no effort has been made to identify root causes of wrong values. Without finding root causes, improvements in the quality are not going to occur. Whenever effort is spent to identify root causes and correct them, improvements follow. The improvements are almost always noticeable and impressive.

All other reasons tend to cause a clustering of data inaccuracies around the faulty process. These are easier to find and correct than the random errors that occur just because people enter data mistakenly. If all we had left were the random errors of people, the errors would be more evenly distributed throughout the database, would be small in number, and would have minimal impact on the uses of data.

So, the long answer is yes. You can get accurate data to a degree that makes it highly useful for all intended requirements.

Finding Inaccurate Values

Some of the inaccurate values can be found, but probably not all of them. In most cases a high percentage of inaccurate values can be found if enough effort is spent searching for them.

the general potential of any collection of data. In any collection there will be some data that is accurate and some that is not. This is shown by the first column. Of the data that is not accurate, some of it is valid and some of it is invalid.


There are two ways you can go about finding inaccurate data: reverification and analysis. If you want to be 100% sure that all values are accurate, you must use reverification. Even then, you cannot achieve this for many types of data. The reverification process may not be possible for some data. Reverification is done by people, and they may make mistakes in the reverification process.

The maximum potential of analytical techniques also falls short of perfect even if you employ all analytical techniques available.

Reverification

This means that you manually go back to the original source of the information and check every value. It is the only sure way to determine what values are right and what values are wrong. The reason for this is that analytical techniques cannot tell if a valid value is accurate unless there is a secondary source of information to correlate the value against.

Reverification is not always possible. For example, if the data records temperatures as of a point in time, it is impossible to go back in time and measure them again.

Reverification is also susceptible to errors. You would generally think that verifiers would be more careful than original entry people. However, this may not be true. In addition, the source of information that provided wrong information initially may continue to provide wrong information in the reverification. If someone lies about their age initially, they will probably lie about it in the reverification.

You would expect that a reverification would improve the quality of the data by some amount. It would catch more errors than it would create. However, you can never be sure that it catches them all.

Reverification takes a long time. If it were implemented as an operational techniques to ensure the quality of the data before use, it would certainly violate the timeliness requirement for almost all uses.

The primary problem with reverification is that it is impractical for most databases. It is enormously time consuming and expensive. Reverification on databases that are continuously changing is extremely difficult if not downright impossible. For example, a web site that takes 10,000 orders per day is not likely to seek reverification of the information on every order. Companies are not willing or able to use this technique widely.

There are isolated cases for which reverification is practical. On small but important databases, it is not only possible but sometimes appropriate. For example, a database on capital equipment expenditures may use this technique. Another case is to use reverification on only part of the objects in a database. For example, an industrial supplier may verify order information on all orders over a certain amount of value or a certain amount of product.

Selective reverification may also be a technique used in monitoring a database. Random selection of transactions or records for reverification can be a good quality assessment tool either to spot developing problems or to instill confidence among users of the data. This improves the trust dimension of the quality of the data.

Analytical Techniques

Analytical techniques involve the use of software in conjunction with the skills of a data or business analyst to search through databases to find the presence of inaccurate data. Analytical techniques can be used against transactions as they are occurring, against databases as they are changing, or against databases on a periodic basis.

There are five categories of analysis that can be applied to data: data element analysis, structural analysis, value correlation, aggregation correlation, and value inspection. These are individually explored in Chapters 8 through 12. Each of these contributes to finding the presence of inaccurate data. However, not all of them can pinpoint the actual wrong values. The sections that follow briefly describe these types of analysis.

Element Analysis

Element analysis involves looking at individual values in isolation to determine if they are valid. To do this you need a definition of what is valid and what is not. For example, a Social Security number must consist of nine numeric digits. If this is your only definition, you will find that all values that are blank, contain characters other than numeric or contain less than or more than nine digits. However, you can go further in your definition. The government employs a scheme of assigning numbers that allows you to examine the value in more detail to determine if it is valid or not. Using the larger rule has the potential for finding more inaccurate values.

Element analysis can include examination based on data type, length, range of values, list of discrete values allowed, patterns allowed, and any other information that is helpful. The more definition you provide, the more potential you have for finding invalid values.

A simplified but often overlooked technique for element analysis is visual inspection. This method involves looking at values to see if they belong to the data element or not. For example, you could not write a programmatic rule to find invalid names for names that contain valid characters. However, a human could look at the list and find wrong or questionable entries because of the superior human capability to deal with semantics and context. For example, names such as Donald Duck, xwdcgex, Don't know, and Wouldn't tell me would be spotted by the analyst and ruled to be invalid. You could not write a comprehensive enough rule to find these through a program.

Structural Analysis

Structural analysis techniques involve checking columns of values for uniqueness or consecutiveness, checking for orphans on collections of records with connecting parts, and checking for circular relationships. For example, no two employees can have the same employee number. Each employee record must have a corresponding emergency contact record, and each emergency contact record must belong to a valid employee record. An example of circular relationships would be that a part cannot be used in the construction of itself.

Structural analysis techniques are used on collections of records. They can identify clear inaccuracies in the data. They generally isolate the error to a small number of records. They do not identify the offending values. Reverification generally is required to pinpoint the actual offender.

Value Correlation

Value correlation analysis involves checking collections of values against a rule that must hold true over the data. For example, if an EMPLOYEE_TYPE field indicates a part-time employee, the PAY_TYPE field must be hourly. This is a rule that applies to a subset of the records in the database. If it is violate, an inaccuracy exists in the data. However, you cannot tell which value is wrong, EMPLOYEE_TYPE or PAY_TYPE. You just know that one of them is wrong.

Value correlation can be very simple or very complex. It may involve only values in a single row, values in multiple rows of the same relation, or values that cross over rows of multiple tables.

There is generally a large number of rules that can be written for cross-checking values to ensure that the set of values represents a valid combination. When a rule is violated, you cannot tell which value is the offender unless the rule involves only one data element. In addition, rules can be satisfied with completely inaccurate data. In the previous example, if the employee record showed EMPLOYEE_TYPE as part-time and PAY_TYPE as hourly, it would be satisfied even though it is possible the EMPLOYEE_TYPE field is inaccurate (i.e., it is really a full-time employee). In fact, both EMPLOYEE_TYPE and PAY_TYPE can satisfy the rule, even though both are inaccurate.

Aggregation Correlation

Aggregation correlation analysis is used to identify the presence of inaccuracies through examining aggregated values over large amounts of data. For example, a rule may say that the count of orders by region for a week should be greater than 100 and less than 1,000. Violations would indicate that either data is missing or orders have the wrong date on them.

Value Inspection

Values can often be computed that allow the analyst to inspect them and determine if they are reasonable or not. Visual inspection differs from aggregation correlation in that it applies to cases where it is not possible to create a clear rule that defines the boundary between right and wrong. Small amounts of inaccurate data may not provide a clear indicator. However, clearly unreasonable data will jump out at the analyst.

For example, the distribution of values within a column may indicate that a problem exists. If 60% of the records indicate the same color of product even though you know that the colors should be approximately evenly divided over a set of 10 colors, the data is clearly inaccurate.

Aggregation tests can be applied to distribution of values, counts, sums, averages, and medians. They can also be used to compare aggregations between groups such as between sales regions.

In summary, element analysis is used to find only invalid values. Structural analysis, value correlation, aggregation correlation, and value inspection are used to find the presence of inaccuracies among valid values. They cannot identify the offending values but can determine with certainty that some of the values are wrong.

Looks Like a Duck, Walks Like a Duck

Although it is clear that you cannot find all inaccurate data values in a database, there is clearly value in finding out what you can detect. If a database has inaccurate data within it, it is very likely that analytical techniques will surface enough of them to get a good picture of the health of the data. It is highly unlikely that all or most of the inaccurate values can hide out as valid values that successfully correlate through all rules. You will find enough inaccurate data, provided you do a thorough enough job of analysis.

You also do not need to worry about seeing only the tip of the iceberg. In most cases, the inaccuracies disclosed through analysis represent a significant portion of the bad values.

A quality improvement program will probably improve the accuracy of data that was not discoverable through analysis along with the inaccuracies that were discovered. This is a secondary effect. For example, if your analysis indicates that 8% of the records in a specific database have at least one wrong value, you implement changes targeting the areas you uncovered and then reassess to find that the new number is 1%. The actual improvement is probably greater than the 7% indicated because of the effect of improvements on values not discoverable through analysis. The result is that users may feel that a much bigger improvement has actually taken place. indicates cross-correlations in terms of the effects of improvements in the accuracy of data values.

How Important Is It to Get Close?

You can achieve very high levels of accuracy of data within a database if enough energy and resources are applied. Although data accuracy is only one component of data quality, it is clearly the single most important component. If the data values are just plain wrong, missing, or inconsistent, any attempt to use the data will be tainted. Every quality improvement program must begin by dealing with data accuracy.

Most decision support applications have a tolerance level for inaccurate data. Inaccuracies up to the tolerance level allow the application to provide high-quality decisions. The inaccuracies do not change the outcome from what it would be if the data were 100% accurate, provided the data inaccuracies are not unduly biased.

Above the tolerance level, the data will generate wrong decisions but will not be noticed because the decisions are not so bad. This is a dangerous situation because the company is acting in the wrong way to data that they believe to be good. It leads to inefficiencies that are not noticed. At some higher level of inaccuracies, the data becomes not believed and has no effect on decisions because it is not used. depicts the relationship of usefulness and accuracy as a step function influence on tolerance levels.


Figure 2.4: Step function influence on tolerance levels.

Most business analysts have no idea how to compute the tolerance levels for decisions they make. Because they have no idea how bad the data is, they must believe what they see.

This highlights two important aspects of working to improve the accuracy of data. First, you have a good chance of bringing the data accuracy back to the first area (the good place to be). You do not have to get to zero errors, you just need to get into the good zone. Second, you provide valuable information to the decision makers on the relative accuracy of the data. If you do not have an assessment program, you either blindly believe the data or mistrust it enough to either use it with caution or never use it.

Because decision-making efficiency is a step function of data accuracy, it follows that small improvements in the accuracy can lead to very large payoffs in value. If the quantity of inaccuracies is putting you in the wrong decision zone, and improvements move you into the zone of right decisions, the difference in value to the corporation can be enormous.

If you have no data quality program, there is probably a huge potential value in instituting one. You have no idea how much value is there because you are blindly using the data you have and have no idea how bad it is. The cost of lowering the percentage of inaccurate data will almost always pay off big for early improvements. As you get closer to zero errors, the cost will prove to be excessive in comparison to gain. However, you can get very close before the crossover occurs.

Closing Remarks

Data accuracy is the most visible and dramatic dimension of data quality. It is the easiest to expose, the easiest to make improvements in, often does not require system reengineering to achieve improvements, and often does not require reorganization of your corporation to accommodate it. Although you cannot get to perfect accuracy in your data, you can improve the accuracy to the point where it consistently provides information that drives correct decisions.

Data accuracy is a complex subject that needs to be fully understood. The concepts of valid versus invalid, inconsistencies in representation, object-level inconsistency, representation of values not known, and missing information are all part of defining accuracy.

There are two methods of determining the accuracy of data: reverification and data analysis. Neither one can guarantee finding all inaccurate values.

Reverification is generally too expensive and slow to be effective. Analytical techniques are easier to use. Analytical techniques require that you understand what the definition of "correct" is for each data element. Just as accurate data is the foundation component of data quality, analytical techniques are the foundation component of any effective data quality assurance program.

No comments: