Designing your own tests¶

Before we jump into the API and all the premade tests that have been written to make your life easier, let’s talk about a process for designing your own machine learning tests. The reason for doing this is important, machine learning testing is not like other software engineering tests. That’s because software engineering tests are deterministic, like software engineering code ought to be. However, when you write tests for your data or your machine learning model, you need to account for the probabilistic nature of the code you are writing. The goal, therefore is much more fuzzy. But the process defined below should help you out.

It’s About Proving Or Disproving Assumptions¶

There are a standard set of steps to any machine learning project:

Exploratory Analysis
Data Cleaning
Model Evaluation
Productionalizing The Model
Monitoring The Model

Machine learning tests are really about monitoring, but the big open question is, what do you monitor?

Monitoring the steps you took in 1-3 above, gives at least a base line. There will likely be other things to account for and monitor once you go into production, but what you’ve found in evaluation will likely be helpful later. So that should inform your first set of tests.

Data Monitoring Tests¶

Specifically, we can monitor the data by:

checking to see if any descriptive statistics you found have changed substantially
checking to see if current data is correlated with previous data per column
checking to see if columns that were correlated or uncorrelated in past data remain that way
checking to see if the number of clusters in the data has changed in a meaningful way
checking to see whether the number of missing values stays consistent between new and old data,
checking to see certain monotonicity requirements between columns remain consistent

It is an imperative to model the data because your model is merely a function of your data. If your data is bad or changes in some important way, your model will be useless. Also, there may be more measures you used to evaluate the data and those may become important features of whatever model you build later on. Therefore, making sure your data continues to follow the trends found previously may be of great import. Otherwise, your model might be wrong and you’d never know it.

Model Monitoring Tests¶

Additionally, we can monitor the model itself:

checking to see if the model meets all metric requirements as specified by the business use-case
checking to see if the model does better than some other test model on all measures of interest

System Monitoring Tests¶

Finally, there are also traditional tests one should run:

making sure the serialized model exists where expected
making sure the data exists where expected
making sure data can flow into the system, to the model and through it
making sure the new data matches the types you expect
making sure the model produces the types you expect
making sure new models can be deployed to the model pipeline
making sure the model can perform well under load
making sure the data can flow through fast enough to reach the model at ingress and egress

These three classes of machine learning system evaluation form a minimal reference set for monitoring such a system. There are likely more tests you’ll need to write, but again just to outline the process in clear terms:

Look at what you wrote when you did exploratory analysis and data cleaning, turn those into tests to make sure your data stays that way, as long as it’s supposed to
Look at how your model performed on test and training data, turn those evaluation measures into tests to make sure your model performs as well in production
Make sure everything actually goes from point A (the start of your system) to point B (the end of your system).

Fairness Monitoring Tests¶

There is a fourth class of tests that are unclear regarding the ethical nature of the algorithm you are building. These tests are unfortunately poorly defined at the present moment and very context specific, so all that can be offered is an example of what one might do:

Suppose you worked for a bank and were writing a piece of software that determined who gets a loan. Assuming a fair system folks from all races, genders, ages would get loans at a similar rate or would perhaps not be rejected due to race, gender, age or other factors.

If when accounting for some protected variable like race, gender, or age your algorithm does something odd compared to when not accounting for race, gender, or age then your algorithm may be biased.

However, this field of research is far from complete. There are some notions of testing for this, at the present moment they appear to be in need of further research and analysis. However, if possible, one should account for such a set of tests if possible, to ensure your algorithm is fair, unbiased and treats all individuals equally and fairly.