[ad_1]
Measures of data quality
Capturing the user’s perspective on data quality is undoubtedly a valuable initial step. But this may not include test perfection. Extensive literature reviews address this issue for us, suggesting dimensions of data quality that are relevant to most use cases. It’s a good idea to review the list with data users and jointly determine which measures apply and design the tests accordingly.
| Accuracy | Format | Comparability |
| Reliability | Interpretability | Conciseness |
| Timeliness | Content | Freedom from bias |
| Relevance | Efficiency | Informativeness |
| Completeness | Importance | Level of detail |
| Currency | Sufficiency | Quantitativeness |
| Consistency | Usableness | Scope |
| Flexibility | Usefulness | Understandability |
| Precision | Clarity | |
This list may seem too long and you may be wondering how to get started. Data products or any information system can be observed or analyzed from two perspectives: external and internal.
exterior view
The external view refers to the use of data and its relationship with the organization. It is often considered a “black box” in functionality that represents a real-world system. The dimensions that fall into the external view are highly business oriented. Sometimes, the evaluation of these measures can be subjective, so it is not always easy to create automatic tests for them. But let’s check some known dimensions:
- relevance: How data is usable and useful for analysis. Review of a marketing campaign aimed at promoting a new product. All data attributes should directly contribute to campaign success, such as customer demographics and purchase data. Data such as city weather or stock market prices are irrelevant data in this case. Another example is the level of detail (granularity). If a business wants market data to be on a daily level, but it is delivered on a weekly level, then it is not relevant and useful.
- representation: The extent to which the data is interpretable for data users and the data format is consistent and descriptive. When accessing data quality, the importance of the representation layer is often overlooked. It includes the format of the data to be consistent and user-friendly, and the meaning of the data to be understandable. For example, consider a scenario where the data is expected to be available in a CSV file with a descriptive column description and the values are expected to be in Euro currency rather than cents.
- Time: How new is the data to data consumers? For example, a business needs sales transaction data with a maximum delay of 1 hour from the point of sale. This indicates that the data pipeline needs to be updated frequently.
- Accuracy: How well the data conforms to business rules. Data metrics are often associated with complex business rules such as data mapping, rounding modes, etc. Automated data logic tests are recommended, and the more the better.
Of the four dimensions when it comes to creating data tests, timeliness and accuracy are more straightforward. Timeliness is achieved by comparing the current timestamp of the timestamp column. Accuracy tests are possible through customer inquiries.
Interior view
In contrast, the internal view refers to an operation that remains independent of specific requirements. They are essential regardless of use cases. The dimensions in the internal view are more technically oriented than the business-oriented dimensions in the external view. It also means that data tests are less user dependent and can be automated most of the time. Here are some key perspectives:
- Quality of data source: The quality of the data source significantly affects the overall quality of the final data. A data contract is an excellent initiative to ensure the quality of source data. As consumers of source data, we can take a similar approach to monitoring source data as data stakeholders do when evaluating data products.
- Done: How much information is stored in its entirety. As the complexity of the data pipeline increases, the probability of information loss at intermediate stages is higher. Consider a financial system that stores customer transaction data. The completeness test ensures that all transactions go through the entire lifecycle successfully without being skipped or missed. For example, the final account balance should accurately reflect the real-world situation, capturing all transactions without any omissions.
- Uniqueness: This dimension goes hand in hand with the completeness test. While completeness guarantees that nothing is lost, uniqueness ensures that data is not duplicated.
- Consistency: How consistent is data across internal systems on a daily basis. Inconsistency is a common data problem that often stems from data silos or inconsistent metric calculation methods. Another aspect of the consistency issue arises between days when a steady growth pattern of data is expected. Any deviation should raise a flag for further investigation.
Note that each dimension can be associated with one or more data tests. Understanding the appropriate use of dimensions on specific tables or metrics is critical. Only then, the more tests are used, the better.
So far we have discussed dimensions of external and internal beliefs. In future data test design, it is important to consider both external and internal perspectives. By asking the right questions to the right people, we can improve efficiency and reduce miscommunication.
[ad_2]
Source link