Data Quality, Data Balance and Data Documentation: a framework
Date:
To promote the responsible development and use of artificial intelligence (AI), principles of trustworthiness, accountability and fairness should be followed. These principles can be threatened by various ethical challenges. Several approaches can be used, such as working at the data level, as data is one of the key elements of the AI pipeline.
However, ensuring high data quality alone is not sufficient to avoid all ethical concerns. Expanding data quality frameworks to include data balance and data documentation can be helpful in addressing critical aspects of ethical considerations related to AI systems. The proposed framework introduces additional quality measures, such as the assessment of data balance and the quality of data documentation. The former can be useful to identify risks of disproportionate treatment of different groups based on their protected characteristics. The latter emphasises the importance of documenting datasets, making them more transparent and accountable.
By integrating these measures into the development pipeline through appropriate data labels, we aim to empower practitioners to build more responsible systems. We outline future research directions for automating these metric evaluation processes.