No one should have confidence in a model without looking at the data…


Tom cat has yet to find Jerry mouse; instead, he has achieved a high training accuracy against training data. Unseen data brings a nasty surprise – the model has only achieved a very low validation accuracy. Is it the fault of the choice of classifier/preditor or an issue with the data?

My own sinking

I got caught in the Titanic ML challenge. I have prepared my data and chosen my prediction techniques using unseen testing datasets – see this Kaggle notebook. Some unimpressive results were at around 70% accuracy on the provided testing datasets; approximately 240 passengers were mispredicted. I computed the fitness of these models against some training data with known observations, including survival status. The correct predictions of surviving the accident contributed to measuring accuracy, loss and other metrics. 

The Titanic ML challenge is a toy problem, and I failed to find decent results by applying the general rules. Other attempts have perfectly predicted surviving or perishing of the accident, and some of my models could anticipate different patterns in the testing datasets. 

Nonetheless, I have discovered again first-class passengers and women were more likely to survive the accident. I have identified (1) gender (31%), (2) age (25%), fares (19%), and (3) class (13%) could be explanatory variables.  

Some mispredictions suggest some single male passengers have survived. Still, women travelling with a spouse or with some children did not. The misclassified passengers are most women, and their age seems older than the ones correctly classified.

Some mispredictions suggest some single male passengers have survived. Still, women travelling with a spouse or with some children did not. The misclassified passengers are most women, and their age seems older than the ones correctly classified.

A pattern of lower expected fares and a more compact fare spread appears for misclassified passengers across several classes. More analysis suggests a weak negative linear relationship between fares and passenger class, and this observation may bring some new complexity and challenge to learning to differentiate.

Overall, I am somewhat satisfied with the outcome. Some interesting patterns within the data were discovered or discovered again, and I have applied many advanced statistical methodologies and compared their results, strengths and limitations with the data. I have identified data features that may further influence some of my explorations. I would combine the testing and training datasets and impute missing values with “Missing Forrest” or “Mice” techniques missing values. It may reduce the gap between both datasets and improve training and prediction. Such models may only be general to the Titanic datasets, which would be disappointing.

Cleaner data, cleaner air, better predictions

I am happier that I have had more success with some air pollution data captured across Berlin. The notebook shows more substantial datasets with repeated observations from some exact locations within seconds. There are many more opportunities for a model-fitting process to identify patterns and features for various classes of air quality measurement.

The data engineering applied statistical normalisation and other data engineering techniques to prepare for some ensemble and tree-based ML algorithms. The training and testing datasets predictions are 94% and 93%, respectively. Some misclassification may occur between two low-air pollution classes and some high-air pollution classes. Overall, it could be better. Nonetheless, the model-fitting processes have distinguished suitably between good and worse air quality. For some applications with some refinements to the models for Berlin, we could adapt the process to other cities with similar data captured. However, a lot of validation from air pollution specialists should tell us whether such models are suitable. 

To conclude


Tom cat – all of us – may need to start understanding the data, with their features and imperfections. Instead of using some predictive, classification, and other techniques without clues, we must try to match data against advanced statistical methodologies/ML to validate and explain strengths and limitations. Otherwise, we may not augment our intelligence but reduce it instead.

, ,

Leave a comment