We see a lot of machine learning projects that have failed to get results or are on the edge of going off the rails. Often, our tools and structured approach can help, but sometimes they become helpless. Machine learning projects can be successful if we analyze certain factors. Here are 3 ways to make it work successfully.
1. Understanding Ground Truth
Machine learning isn't a magic wand, and it doesn't work by telepathy. An algorithm for any machine learning project needs data and examples of what it is trying to detect. It also needs examples of what it is not trying to detect so that it can tell the difference. This is particularly true of "supervised learning" algorithms, where they must train on sufficient numbers of examples to generate results. The same applies to "unsupervised learning" algorithms, which attempt to discover hidden relationships in data, without being told ahead of time. If relationships of interest don't exist in the data, no algorithm will find them.
2. Curate the Data
Data should be clean and well-curated. Meaning that to get the best results, it is important to have faith in the quality of the data. Misclassifications in training data can be particularly damaging in supervised learning situations — some algorithms (like ours) can compensate for occasional miss-classifications in training data, but pervasive problems can be hard to overcome.
3. Don't Overtrain
Overtraining is when a machine learning model can predict training examples with very high accuracy but cannot generalize to new data. This leads to poor performance in the field. Usually, this is a result of too little data or data that is too homogenous. The data does not truly reflect the natural variation and confounding factors that will be present in deployment. The data can also result from the poor tuning of the model. Overtraining can be particularly harmful, as it can lead to false optimism and premature deployment. This results in a visible failure that could easily have been avoided. Our AI engineers oversee and check customer's model configurations to prevent this unnecessary pitfall.
Case Study: Machine Health Monitoring System
(Names and details have been changed to protect the inexperienced.)
We recently had a client trying to build a machine health monitoring system for a refrigerant compressor. These compressors were installed in a system subject to rare leaks. They were trying to detect in advance whether the refrigerant in the lines has dropped to a point that put the compressor at risk. The system needs to detect before it causes any damage, overheating, or shuts down through some other mechanism. They were trying to do this via vibration data, using a small device containing a multi-axis accelerometer sensor mounted on the unit.
Limited Set of Data and Overtraining
Ideally, this client would have collected a variety of data with the same accelerometer under known conditions: including many examples of the compressor running in a range of normal load conditions, and many examples of the compressor running under adverse low refrigerant conditions in a similar variety of loads. They could then use our algorithms and tools in confidence that the data contains a broad representation of the operating states of interest. The data will also include normal variations as load and uncontrolled environmental factors change. It would also contain a range of different background noises and enough samples so that the sensor and measurement noise is well represented. But all they had was 10 seconds of data of a normal compressor and 10 seconds with low refrigerant collected in the lab.
Human Monitoring vs. Machine Learning Algorithm Monitoring
The limited collected data might be enough for an engineer to begin to understand the differences between the two states. A human engineer working in the lab might use domain knowledge about field conditions to detect those differences in general. But a machine learning algorithm knows only what it sees. It would make a perfect separation between training examples, showing a 100% accuracy in classification. But that result would never generalize to the real world.
Solution for This Machine Learning Model
The most reliable approach is to include examples in the data of a full range of conditions for all the operational variations possible. It should include both normal and abnormal conditions. This allows the algorithms to learn by example and tune themselves to the most robust decision criteria.
Reality AI tools automatically do this by using a variety of methods for feature discovery and model selection. To help detect and avoid overtraining, our tools also test models with "K-fold validation". This is a process that repeatedly retrains but holds out a portion of the training data for testing. This simulates how the model will behave in the field when it attempts to operate on new observations it had not trained on. K-fold accuracy is rarely as high as training separation accuracy, but it's a better indicator of likely real-world performance. It is better at least to the degree that the training data is representative of the real world.