There is no need to emphasize the importance of having a large amount of high-quality data for AI development to achieve the best results and how data limitations can lead to poor results, which can harm society and individual rights.
Lack of data access, availability, and quality can have far-reaching negative consequences, including racial bias. The inability to fully utilize artificial intelligence (AI) may be hampered by a lack of high-quality, harmonized data.
Speech technologies, trading and investment tools, medical research, law enforcement, environmental forecasting, and self-driving cars are just a few examples of how AI is used in products that improve people’s lives. Unfortunately, the quality of the data available will influence the final product’s quality.
This article will go over some of the key data characteristics that have an impact on AI development.
1. Data quality and governance
Although some types of AI are designed to work with unstructured or low-quality data, data quality is critical for most AI developments to make accurate, valid, and unbiased, real-world decisions. There is no single definition of data quality, and different approaches to defining data quality benchmarks have been taken. However, the common denominator for quality data is ‘data that is fit for use,’ which meets specifications, requirements, and expectations. Data quality is currently defined as the data’s accuracy, completeness, accessibility, consistency, and readability.
2. Representativeness
The training data must be representative of the real-world demographic for an algorithm to work. Many issues can arise due to the underrepresentation of real-world scenarios, such as bias in facial recognition and gender classification tools.
NIST attempted to quantify the accuracy of facial recognition algorithms defined by sex, age, race, or country of birth using four datasets from different sources in the United States. According to the findings, females, indigenous peoples, Asians, and African Americans all had high rates of false positives. False positives on images of Asian subjects were significantly lower in China. The importance of representativeness in the training data used is further emphasized by the difference in algorithmic performance based on development location.
Simply put, if the training and validation data does not reflect the real-world population, AI is likely to miss them in real-world applications, potentially exacerbating existing inequalities. The other side of this issue is the overrepresentation of sensitive characteristics introduced through data collection and selection.
3. Accuracy
The assumption that the information or value conveys the true state of the source is factually correct and unambiguous is referred to as data accuracy. A data value’s accuracy is determined by comparing it to a known reference value. The degree to which accuracy can be determined varies, and it can be influenced by the context or additional information that needs to be verified. Data curation and accuracy evaluation frequently necessitate the input of trained experts in the field and data curators. However, verification and precision assessment are impossible in many cases, such as with data corpuses made up of social media posts. The value of data accuracy varies by application, but in general, accuracy is one of the most important data characteristics that influences AI outcomes.
4. Completeness
The term “data completeness” refers to data that contains no missing values. There are no flaws in a complete dataset that affect the data’s usability, accuracy, and integrity. Because of the rapid changes in big data, incompleteness is a common characteristic of low-quality data. Assume data isn’t acquired in real-time or isn’t processed promptly. In that case, there is a risk of using out-of-date data and, as a result, producing inaccurate results and making incorrect decisions/predictions, which could have serious financial and ethical consequences. As a result, since an algorithm that makes predictions based on temporal data requires timely data to make accurate predictions, the training data will likely need to be updated as needed. Completeness can be achieved through data correction or imputation, both of which can be time-consuming.
5. Accessibility
Accessibility of relevant datasets is an important consideration, as AI development thrives on access to big and varied datasets. Accessibility directly relates to several factors such as privacy regulations such as the GDPR or HIPAA (sensitive data is inaccessible and protected), legal and administrative barriers that affect timeliness, commercial restrictions, ownership, discoverability, etc.
6. Coverage
Coverage is a term that describes how representative a dataset is in terms of geography and demographics. To avoid bias, a dataset’s coverage must be high. Geospatial AI refers to the development of ‘digital geospatial information representing space/time-varying phenomena,’ as well as geographic and temporal coverage, or ‘frequent revisit times.’ In a healthcare setting, good data coverage may be difficult to achieve in low and middle-income countries where healthcare or representative data are unavailable and in high-income countries where access to healthcare may be unequal due to costs.
The characteristics of data that can influence the outcome of AI technology have been briefly described in this post. Data Quality Assessments (DQA) and Data Gap Analysis are two processes that can help assess data and mitigate any quality issues. Using good Data Governance to ensure data quality throughout the AI development process is critical in avoiding what has been dubbed “garbage in, garbage out,” where poor data leads to poor results.