Turning data into formulas can result in simple but powerful models
Published in · 4 min read · Nov 17, 2020
--
The goal of a regression model is very simple: take as input one or more numbers and output another number. There are many ways to do that, from simple to extremely complex.
The simplest case is that of linear regression: the output is a linear combination of the input variables, with coefficients chosen to minimize some training error. In many contexts, a simple model like this will be enough, but it will fail in cases where nonlinear relationships between the variables are relevant. In the strongly nonlinear world that we live in, this happens very often.
On the other side of the spectrum of model complexity are black-box regressors like neural networks, which transform the input data through a series of implicit calculations before giving a result. Those models are very popular nowadays due to the promise that they will one day result in a general “artificial intelligence”, and due to their striking success in difficult problems like computer vision.
Here we want to discuss a middle ground between those two extremes that seems to not have received the attention that it deserves so far: symbolic regression.
A generalization of the concept of linear regression or polynomial regression is to try to search over the space of all possible mathematical formulas for the ones that best predict the output variable taking as input the input variables, starting from a set of base functions like addition, trigonometric functions, and exponentials. This is the basic idea of symbolic regression.
In a symbolic regression optimization, it is important to discard a large formula if a smaller one with the same accuracy is encountered. This is necessary to avoid obviously redundant solutions like f(x)=x+1–1+0+0+0, and also to not settle for a huge polynomial with 100% accuracy.
This method was popularized in 2009 with the introduction of a desktop software called Eureqa [1], which used a genetic algorithm to search for relevant formulas. This software gained notoriety with the promise that it could eventually be used to derive new laws of physics from empirical data — a promise that was never quite fulfilled. In 2017 Eureqa was aqcuired by a consulting company and left the market [2].
Recently new symbolic regression tools have been developed, such as TuringBot [3], a desktop software for symbolic regression based on simulated annealing. The promise of deriving physical laws from data with symbolic regression has also been revived with a project called Feynman AI, lead by famous physicist Max Tegmark [4].
Despite the efforts to promote symbolic regression over the years, the truth is that this method has never gained mainstream popularity. In an academic context, research on hot topics like neural networks is much more tractable, given that optimal algorithms are known for training the model. Symbolic regression is just messier and often depends on shady heuristics to work efficiently.
But this should not be a reason to disregard the method. Even though it is hard to generate symbolic models, they have some very desirable characteristics. For starters, a symbolic model is explicit, making it explainable and offering insight into the data. It is also simple, given that the optimization will actively try to keep the formulas as short as possible, which could potentially reduce the chances of overfitting the data. From a technical point of view, a symbolic model is very portable and can be easily implemented in any programming language, without the need for complex data structures.
Perhaps Eureqa’s glib promise of uncovering laws of Physics with symbolic regression will never be fulfilled, but it could well be the case that many machine learning models deployed today are more complex than necessary, going to great lengths to do something that could be equivalently done by a simple mathematical formula. This is particularly true for problems in a small number of dimensions — symbolic regression is unlikely to be useful for problems like image classification, which would require enormous formulas with millions of input parameters. A shift to explicit symbolic models could bring to light many hidden patterns in the sea of datasets that we have at our disposal today.
[1] Schmidt M., Lipson H. (2009) “Distilling Free-Form Natural Laws from Experimental Data”, Science, Vol. 324, no. 5923, pp. 81–85.
[2] DataRobot Acquires Nutonian (2017)
[3] TuringBot: Symbolic Regression Software (2020)
[4] Udrescu S.-M., Tegmark M. (2020) “AI Feynman: A physics-inspired method for symbolic regression”, Science Advances, Vol. 6, no. 16, eaay2631