Julian Miller series: Optimisation and learning

Previous posts discussed human and machine learning. We could consider machine learning as the activity that designs algorithms that extract useful information from data. Nowadays, the design often refers to the parametrisation or providing data relevant to finding useful information. However, our definition suggests several other key elements; i.e., algorithms, data, useful information. Algorithms were discussed in one of my early post. Without data, it is unlikely to have any computing, but useful information is worth considering.

How can we measure usefulness?

I suppose it is a challenging idea for human and other animals. If something is useful, then we would have explored, tried, and tested in practical ways against a specific purpose. Let’s explore the purpose of brewing tea or coffee. Both purposes require some tools to heat the water to a suitable temperature. Probably, many of us have experimented with tea makers, kettle, pan, electric cafetières, one on top of the stove, or even some that uses pods. Through our experiences and tasting the beverages, we established the usefulness of some tools.

So, defining usefulness of anything, we need to compare against some other options. The act of comparing may have some criteria and goals to assess and provide of measure of the suitability against a goal. This is the key element of defining usefulness.

How can machines measure the usefulness of information?

That is quite a good question, but not easy one to answer. Information is the outcome of processing some data. Supervised machine learning relies on labelled data to measure the quality of useful information found. One approach computes the distance away between some predicted and labelled data. This option is preferred by deep learning and random forrest.

Cartesian Genetic Programming relies on known minima (i.e., solutions considered as good against some well-defined goals). Often a learning objective function or a fitness function is specifically defined and encoded to a specific goal.

K-Nearest neighbours algorithm uses on Euclidean distances to establish suitability of clusters and their centres. All these techniques are considered as supervised machine learning, but applies some advanced statistical methodologies or some probabilistic models. All these methods rely on human inputs to guide and establish the idea of useful information – the outcome of an algorithm.

So any machine learning engineers must be aware that any human bias brought in labelled datasets used for training models is likely to distort the predictions on unknown datasets. It is worth to statistically analyse those datasets before any model fitting exercise to identify any human bias added to the data.

Let’s optimise the level of useful information

Let’s imagine we have a set of alternative options. We have also a set of criteria that defines some requirements that needs to be met. These criterion will help us identifying whether a option is suitable.

The human approach may be pragmatic. We could consider some alternative options and measure in term of our feelings; positive feelings would suggest an alternative is good for purpose.

Mathematical optimisation uses a set of criteria. Positive features of a solution is rewarded low values, but features perceived as negative are set large values. Solution that score the lowest value possible represent the best solution. We refer it as global optima.

Statistical methodologies such KNN minimises the distance between all the points in a cluster. The shortest total euclidien distance, the better the solution. Deep learning and Random Forrest reduce the error to know labels. So for example, if many customers purchased some books from a certain author and some other items, we can start to identify some purchase patterns between items. The useful information can help us maximising our offers as well as marketing. Why do you think most supermarkets have some rewards points and personalised offers? However, it can only be possible with extremely large datasets that captured observations from the real world. Otherwise, these techniques find it challenging to classify and predict outcomes. We discussed some of these issues in this post.

CGP and other form of GP techniques minimises design of algorithms or neural networks against a set of statistical measures. Traditionally, the mean average based on the No-Free Lunch theorem. Personally, I prefer some performance based criteria and some coefficient of variation. The training dataset includes instances of a problem domain. The testing dataset unseen instances. So the problem solution can be solutions of an optimisation problem. It could be too the algorithm that has found this solution. CGP and GP can augment intelligence by providing evidences of methodologies that affect positively the problem solving process. If they are human readable, then we can use them again. Useful information becomes two elements; the solver and the solution.

The descent towards a minima

Gradients measure the rate of changes between two states. For example, when we boiled some water and measure the water temperature, the rate of change between the two measurements indicates the steepness of the transformation.

Machine learning algorithms and some model fitting statistical methodologies relies on gradients too. The rate of changes – before and after transforming (1) the weight of a neural network, (2) the clusters in KNN, or (3) mutating the order of operations of solvers (algorithms) – can indicate whether the changes minimised the solutions. Ideally, we are looking forward to a negative rate.

So, in optimisation we are hoping each solver can reduce the minima and find new minima; i.e., a better solution. With neural network we hope to reduce the error between the predicted and labelled data. In KNN, we hope to reduce the total Euclidean distance between in each cluster after each iteration. In CGP, we hope to find improved solvers at each iteration. However, some of the solvers, weight or clusters may not be optimum or sometimes worse. It is fine as long as we continue to search until we find a suitable optimum or near optimum.

https://en.wikipedia.org/wiki/Gradient_descent

To conclude

Learning relies on optimisation. We optimise against some labelled datasets using some potential measures of useful information. The latter can be considered as clusters, probabilities, solvers and solutions to a problem. Without a set of criteria, it becomes a challenging task to let machine learning; they need a set constraints to find solutions. Otherwise, they may obliterate the world.