With the emergent ability to store and analyze big data, many organizations are making data quality the sole responsibility of a single entity. This role of data governance acts to augment the four qualities of solid data.
Proper data governance will assess the data’s quality, then work to maintain and augment it over time. The first step, data quality assessment, looks to audit data’s accuracy, completeness, validity, and consistency. Once complete, the audit will guide future data quality efforts and create the benchmark for future assessments.
The second step of data governance involves data cleansing and transformation. This involves using software tools like Microsoft’s SQL Server or Google Refine to validate and standardize the data while removing redundancies. However, software cannot tend to accuracy or completeness issues without cross-referencing the data against an independent source.
Over time, data quality will naturally deteriorate: addresses will change, buying habits will fluctuate, and so forth. Data cleansing and transformation exist solely to evaluate existing data and are not suited for maintaining the quality of new data. Eradicating the root causes of bad information typically involves dedicated data quality teams and line managers. These team members understand the data, its uses, and its processes. That understanding is used to produce data standards that filter out bad data with a variety of methods, one of which can be semi-automated with a data quality firewall.
While bad data sources can be eliminated, data quality requires constant monitoring to guard against internal errors, bugs, and outdated information. Many companies turn to third-party continuous monitoring systems. These systems minimize downtime and naturally run externally to the system to be watched. This independence prevents a system’s problems from affecting the analysis.
The traditional approaches to improving data quality can be manual or digital. Manual methods require human interaction and as such, they are best-suited to small data sets. Large data sets will involve cost-prohibitive amounts of manual labor and will be more susceptible to human error.
Digital methods typically break down into four categories:
- Native solutions use software specialized to handle data native to a particular system. It is usually expensive, though efficient so long as it works only within the confines of the assigned system.
- Task-limited solutions offer more breadth; this software can work with a multitude of systems but has limited functionality (i.e., removing duplicates).
- SQL-based solutions and their kind are not data-specific and function best for initial data assessment. Long-term application of these solutions may reduce flexibility and increase operational costs unless team members are intimate with the software.
- In-house customized solutions are written for a specific purpose tailored to the needs of the company. The inherent customization may suit some organizations; for others, the cost of development, maintenance, and training will prevent its use.
Data quality must be assessed and nurtured if it is to be of any use. While an initial audit will find problems and allow for data cleansing and transformation, most data requires a dedicated team to find and eliminate bad data sources. As big data analytics enters the picture, data governance functions as the only practical way of preventing costly, through analysis of corrupt information.
About the Author: Jake Gavins works as an custom applications developer and has several years of business intelligence development experience. During his spare time, he likes to go golfing, hiking and to keep up with the latest in technology.