Symbolic Regression: The Forgotten Machine Learning Method (2024)

Turning data into formulas can result in simple but powerful models

Symbolic Regression: The Forgotten Machine Learning Method (1)

Published in

Towards Data Science

·

4 min read

·

Nov 17, 2020

--

Symbolic Regression: The Forgotten Machine Learning Method (3)

The goal of a regression model is very simple: take as input one or more numbers and output another number. There are many ways to do that, from simple to extremely complex.

The simplest case is that of linear regression: the output is a linear combination of the input variables, with coefficients chosen to minimize some training error. In many contexts, a simple model like this will be enough, but it will fail in cases where nonlinear relationships between the variables are relevant. In the strongly nonlinear world that we live in, this happens very often.

On the other side of the spectrum of model complexity are black-box regressors like neural networks, which transform the input data through a series of implicit calculations before giving a result. Those models are very popular nowadays due to the promise that they will one day result in a general “artificial intelligence”, and due to their striking success in difficult problems like computer vision.

Here we want to discuss a middle ground between those two extremes that seems to not have received the attention that it deserves so far: symbolic regression.

A generalization of the concept of linear regression or polynomial regression is to try to search over the space of all possible mathematical formulas for the ones that best predict the output variable taking as input the input variables, starting from a set of base functions like addition, trigonometric functions, and exponentials. This is the basic idea of symbolic regression.

In a symbolic regression optimization, it is important to discard a large formula if a smaller one with the same accuracy is encountered. This is necessary to avoid obviously redundant solutions like f(x)=x+1–1+0+0+0, and also to not settle for a huge polynomial with 100% accuracy.

This method was popularized in 2009 with the introduction of a desktop software called Eureqa [1], which used a genetic algorithm to search for relevant formulas. This software gained notoriety with the promise that it could eventually be used to derive new laws of physics from empirical data — a promise that was never quite fulfilled. In 2017 Eureqa was aqcuired by a consulting company and left the market [2].

Recently new symbolic regression tools have been developed, such as TuringBot [3], a desktop software for symbolic regression based on simulated annealing. The promise of deriving physical laws from data with symbolic regression has also been revived with a project called Feynman AI, lead by famous physicist Max Tegmark [4].

Despite the efforts to promote symbolic regression over the years, the truth is that this method has never gained mainstream popularity. In an academic context, research on hot topics like neural networks is much more tractable, given that optimal algorithms are known for training the model. Symbolic regression is just messier and often depends on shady heuristics to work efficiently.

But this should not be a reason to disregard the method. Even though it is hard to generate symbolic models, they have some very desirable characteristics. For starters, a symbolic model is explicit, making it explainable and offering insight into the data. It is also simple, given that the optimization will actively try to keep the formulas as short as possible, which could potentially reduce the chances of overfitting the data. From a technical point of view, a symbolic model is very portable and can be easily implemented in any programming language, without the need for complex data structures.

Perhaps Eureqa’s glib promise of uncovering laws of Physics with symbolic regression will never be fulfilled, but it could well be the case that many machine learning models deployed today are more complex than necessary, going to great lengths to do something that could be equivalently done by a simple mathematical formula. This is particularly true for problems in a small number of dimensions — symbolic regression is unlikely to be useful for problems like image classification, which would require enormous formulas with millions of input parameters. A shift to explicit symbolic models could bring to light many hidden patterns in the sea of datasets that we have at our disposal today.

[1] Schmidt M., Lipson H. (2009) “Distilling Free-Form Natural Laws from Experimental Data”, Science, Vol. 324, no. 5923, pp. 81–85.

[2] DataRobot Acquires Nutonian (2017)

[3] TuringBot: Symbolic Regression Software (2020)

[4] Udrescu S.-M., Tegmark M. (2020) “AI Feynman: A physics-inspired method for symbolic regression”, Science Advances, Vol. 6, no. 16, eaay2631

Symbolic Regression: The Forgotten Machine Learning Method (2024)

FAQs

Symbolic Regression: The Forgotten Machine Learning Method? ›

A forgotten method

What is symbolic regression in machine learning? ›

Symbolic regression (SR) is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity. Expression tree as it can be used in symbolic regression to represent a function.

Is symbolic regression NP hard? ›

Symbolic regression (SR) is the task of discovering a symbolic expression that fits a given data set from the space of mathematical expressions. Despite the abundance of research surrounding the SR problem, there's a scarcity of works that confirm its NP-hard nature.

What is the difference between regression and symbolic regression? ›

Unlike traditional regression methods, which fit data to a specific form (e.g., linear, polynomial), symbolic regression makes no initial assumptions about the form of the underlying model.

What are the 6 types of regression models in machine learning? ›

6 Types of Regression Models in Machine Learning You Should Know About
  • Linear Regression.
  • Logistic Regression.
  • Ridge Regression.
  • Lasso Regression.
  • Polynomial Regression.
  • Bayesian Linear Regression.
May 17, 2024

What is the difference between symbolic learning and machine learning? ›

In machine learning, the algorithm learns rules as it establishes correlations between inputs and outputs. In symbolic reasoning, the rules are created through human intervention and then hard-coded into a static program.

What is symbolic regression interpretability? ›

Interpretability in Symbolic Regression: a benchmark of Explanatory Methods using the Feynman data set. In some situations, the interpretability of the machine learning models plays a role as important as the model accuracy.

Is NP-complete harder than NP? ›

A problem is said to be NP-hard if everything in NP can be transformed in polynomial time into it even though it may not be in NP. A problem is NP-complete if it is both in NP and NP-hard. The NP-complete problems represent the hardest problems in NP.

Is Pspace harder than NP? ›

Proving that PSPACE-hard Languages are NP-hard

Since NP is a subset of PSPACE, every language A' in NP can also be reduced to L in polynomial time as it also holds for languages in PSPACE. Thus, L is at least as hard as any problem in NP, making it NP-hard. So, any PSPACE-hard language is also NP-hard.

Does Numpy have linear regression? ›

In order to visualize how the model fits, we first create a plot with the test data points. Then, we plot the model fitting line with the 'plt. plot' function. For a linear regression model made from scratch with Numpy, this gives a good enough fit.

What are the three types of regression? ›

The three common regression models are linear regression, logistic regression, and polynomial regression.

Is Chi Square a regression? ›

It turns out that the 2 X 2 contingency analysis with chi-square is really just a special case of logistic regression, and this is analogous to the relationship between ANOVA and regression. With chi-square contingency analysis, the independent variable is dichotomous and the dependent variable is dichotomous.

What is symbolic regression for scientific discovery? ›

Although SR is regarded as a data-driven model discovery tool, it aims to find a symbolic model that simultaneously fits data well and could be generalized to uncovered regimes. SR is deployed as an interpretable and predictive ML model or a data-driven scientific discovery method.

Which ML algorithm is best for prediction? ›

Logistic regression is a popular algorithm for predicting a binary outcome, such as “yes” or “no,” based on previous data set observations.

What is the best machine learning algorithm for regression? ›

List of regression algorithms in Machine Learning
  • 1) Linear Regression. It is one of the most-used regression algorithms in Machine Learning. ...
  • 2) Ridge Regression. ...
  • 3) Neural Network Regression. ...
  • 4) Lasso Regression. ...
  • 5) Decision Tree Regression. ...
  • 6) Random Forest. ...
  • 7) KNN Model. ...
  • 8) Support Vector Machines (SVM)
Oct 11, 2023

Is XGBoost regression or classification? ›

XGBoost is a supervised machine learning method for classification and regression and is used by the Train Using AutoML tool. XGBoost is short for extreme gradient boosting.

What is regression symbolism? ›

Symbolic regression (SR) is a machine learning-based regression method based on genetic programming principles that integrates techniques and processes from heterogeneous scientific fields and is capable of providing analytical equations purely from data.

What is machine learning symbolic approach? ›

Symbolic machine learning was applied to learning concepts, rules, heuristics, and problem-solving. Approaches, other than those above, include: Learning from instruction or advice—i.e., taking human instruction, posed as advice, and determining how to operationalize it in specific situations.

What do the symbols mean in linear regression? ›

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Francesca Jacobs Ret

Last Updated:

Views: 5961

Rating: 4.8 / 5 (48 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Francesca Jacobs Ret

Birthday: 1996-12-09

Address: Apt. 141 1406 Mitch Summit, New Teganshire, UT 82655-0699

Phone: +2296092334654

Job: Technology Architect

Hobby: Snowboarding, Scouting, Foreign language learning, Dowsing, Baton twirling, Sculpting, Cabaret

Introduction: My name is Francesca Jacobs Ret, I am a innocent, super, beautiful, charming, lucky, gentle, clever person who loves writing and wants to share my knowledge and understanding with you.