The Devil Is in the Data: Machine Learning and Data Governance

January 19, 2021 Michael Easterly

Machine-learning models have become increasingly popular among financial institutions, with applications that include fraud detection, loan underwriting, portfolio management, and securities trading. They are beginning to replace traditional explanatory models, which examine the underlying drivers of modeled outcomes. For solving business problems, managers focus on effective results, using reliable predictions or estimates rather than underlying causal relationships.

Machine learning

Machine-learning models put a premium on predictive accuracy without the constraints of explaining how the algorithm derived the output. Using complex, multistep algorithms, they can capture complicated relationships more efficiently than traditional statistical techniques and often produce more accurate predictions.

These advantages, however, come at a cost. The complexity of machine-learning relationships is often too difficult for humans to understand. Additionally, these models require recalibration more frequently than traditional econometric models. As a result, users have a limited understanding of how their machine-learning models generate conclusions. This becomes especially problematic if the models adapt too closely to “noise,” or idiosyncratic patterns in the data that do not reflect underlying relationships.

The opacity of these models increases the importance of ensuring that the input data are accurate and consistent. Failure to maintain clean data can lead to model outputs that are biased or nonsensical. While “garbage in, garbage out” is a principle that applies to both traditional econometric and machine-learning models, the former is easier to monitor through output analytics. As machine-learning models do not have simple, intuitive relationships, catching errors through output reasonability checks is challenging. The table below describes some data-integrity issues that Summit has run across in our work and their potential consequences.

Data-Integrity Issues and Consequences

Issue Example Consequence
Data-entry errors Staff entered data in the opposite direction than instructed. The model produces inaccurate relationships to make predictions.
Unit of analysis unclear Medical data contained multiple records for the same individual, reflecting repeat visits to providers. Modelers use inappropriate techniques or misinterpret outputs.
Automated software determines variable types or format Character-state codes (‘01’, ‘02’, etc.) were read by software as numeric. The model misinterprets the values as ratios instead of categories.
Record keepers censored certain values To preserve privacy, data collectors recorded incomes greater than $100,000 as being $100,000. The variable biases model outputs.
Data transformations created missing values Zeros became missing values when transformed logarithmically. Useful data are lost because they are commingled with genuinely missing values.

As the table above indicates, seemingly minor and sometimes difficult-to-detect data issues can have significant effects on model performance. Consequently, Summit implements robust data-governance practices in its modeling engagements. Specifically, we take the following steps to mitigate the risk of corrupt data:

  1. Construct a data dictionary. Summit ensures that all data elements are properly and adequately defined and documented. Creating and maintaining a data dictionary with data element names, variable names, definitions, and formats, as well as the origin of the data and the values they can take, allows for an accurate and consistent understanding of the data by all users, even if team members who constructed the data have left the organization.
  2. Conduct preliminary descriptive statistics and data visualizations. Summit uses these techniques to identify unusual patterns in variables and flag them for evaluation before they influence model predictions. While seemingly obvious, it is surprising how often modelers skip this step, with adverse results.
  3. Document all tests and transformations. Human memory is fallible, so Summit documents the processes by which the data are merged and linked. We provide clear, annotated programming code and log files that describe the business rules used for data cleaning, as well as record any technical issues that arose. Summit describes these documents in a user guide that also references the data dictionary.
  4. Institute an ongoing performance-metrics regimen. Summit conducts periodic tests to ensure that model outputs meet expectations and predictions continue to perform accurately. When a model fails testing regimens, we reexamine both the model and the data for unusual trends and relationships.

In summary, advances in statistical theory and data-processing power have enabled the increased use of machine-learning models. For certain business problems, explaining the “how” and “why” behind the results takes a back seat to predictive accuracy. Summit believes that machine-learning techniques can serve as powerful tools when applied appropriately. Nevertheless, we suggest that modelers take extra caution with regard to data integrity. Implementing a systematic data-governance approach can mitigate this risk.

Share This: