Gabriele Gianini & Luigi Guzzo
Università degli Studi di Milano
Artificial Intelligence (AI) covers a wide area of techniques aiming at emulating Human Intelligence through computer algorithms. Symbolic AI was the dominant paradigm between the 1950s and the 1980s: it focused on high-level symbolic (human-readable) representations, predominantly using deductive reasoning to carry out specific inferential tasks. A more recently developed branch of AI – data-driven and mostly Sub-symbolic – is Machine Learning (ML). ML uses predominantly inductive methods: the algorithms can learn directly from the data how to solve a given inferential task. In practice, this means using training examples to obtain an efficient task-oriented representation of the relevant objects, and use that as a basis for prediction, which can come in the form of numerical values (regression) or discrete nominal labels (classification). More complex tasks can consist in learning to assign “structured labels” (e.g., the spatial configuration of a molecule, the syntactic graph of a sentence, an algebraic equation, a textual description for an input picture).
The ability to learn an efficient representation (representation learning) is often the key to the success of a task. For instance, in the task of telling apples from pears – based on labelled colour photos of the two kinds of fruits – an algorithm could find that it is efficient to focus on the minimal size enclosing spheroid, learning that prolate spheroids are most often associated with pears. This would disregard the abundant information on colour, learning a purely geometric representation in which the parameter space is low-dimensional.
ML tasks are typically distinguished into supervised and unsupervised. In the former, objects have labels that one aims at predicting, and the model is trained using examples where the true label is known. Regression and classification are typical examples. This is not the case with unsupervised tasks, where the algorithm finds a way to organize the objects to optimize some criteria. A notable example is clustering, in which objects are organized in groups according to some attribute. In VIPERS, for example, we used the so-called Self-Organising Maps (SOMs) to group galaxies in different redshift groups, according to their colours (Cagliari+ 2022); another example of an unsupervised technique is Principal Component Analysis (PCA): this is a form of representation learning, which finds an orthogonal basis to represent the input data, and allows to rank the components so as to select the dominant ones: for example, we used PCA to express VIPERS spectra as the linear combination of a “basis” of a few (<5) “eigen-spectra” (Marchetti+ 2017).
Talking about which algorithm could be best for our scopes here, it is important to realise that a given task, say “dichotomic classification”, can be accomplished using several different ML algorithms. This is often unclear to the non-expert, confused by the wide variety of algorithms developed over the about sixty years of growth in this field. It is customary to distinguish between Shallow ML learning algorithms and Deep Learning Algorithms, the latter using many parameters in the representation (of the order of thousands or millions). Training the latter models is possible thanks to special training techniques developed in the latest fifteen years (Bengio+ 2016).
The spectrum of techniques ranges from the simple Decision Tree algorithms (where, say in a dichotomic classification task, one scans the different attributes of the objects at hand and splits the set of examples based on the attribute values that maximise the separation of the two labels; repeating this process recursively on the subsets thus obtained one builds a decision graph), to the now renown Deep Learning (DL) Neural Networks. The latter, belong to an extensively studied and remarkably effective family of algorithms, the neural or connectivist algorithms, a.k.a. Artificial Neural Networks (ANNs). These create complex architectures by suitably connecting elementary processing blocks, the artificial neurons: the optimal parameters of a whole ANN are learned by searching within the parameter space an array that minimizes a loss function (in a supervised task, a measure of the distance between the values predicted by the algorithm and the actual values of the training examples). The most common way to achieve parameter optimization is what is known as Gradient Descent search, in which the surface of the loss function in parameter space is explored in search of the minimum. Among the simplest versions of ANNs are the layered architectures known as (Feed-Forward) Multilayer Perceptrons (MLP). These consist in layers of neurons, in which the input to a neuron in each layer is a linear combination of the outputs from all neurons in the previous layer. The parameters to be optimised are the weights of the linear combination, and the Gradient Descent of the loss function is performed through a Back-Propagation Algorithm, which propagates backward, layer-by-layer, a tentative parameter correction computed from the prediction error.
A successful technique used in Deep Learning is based on the separation of the representation learning phase from the prediction learning phase (based on the learned representation) and relies on special ANNs called AutoEncoders (AEs). These are multilayer ANNs characterized by a special training procedure and a symmetric input-encoding-output architecture. They learn by self-supervision, i.e., try to replicate the input example on the output: each input record is used as its own structured label. The minimal architecture of an AE consists of an input layer, a hidden (encoding) layer, and an output layer. After the training, their parameters are frozen, and the last layer(s), following the central (encoding) one, are discarded. What is left is a function that maps the input into a different representation, often endowed with desirable properties.
Deep Learning neural models have achieved in recent years performances comparable or superior to human performances at some decision and prediction tasks. Most often however the representations learned by neural models and the reasoning path to prediction is typically non-human-readable/interpretable, as opposed to what happens with Symbolic AI (from this, the “black-box” feeling the non-expert often experiments when trying to approach ML applications).
This has led in recent years to the raise of the research field of eXplainable Artificial Intelligence (XAI), which has developed methods of several kinds to answer questions, encompassed by the different forms that interpretability can take. Among them are methods for the design of transparent boxes (models endowed with decomposability and algorithmic transparency), and methods for the post-hoc explanation of black boxes. Among those models that are transparent (i.e., “white boxes”) by design – and that can either be learned directly from the data, or be used as surrogate models for post-hoc explanation – are the following: Decision Rules, Decision Trees, Bayesian Models including Bayesian Belief Networks, Linear Models, and Case-Based Reasoning.
A rising discipline related to XAI is Neuro-Symbolic AI (a.k.a. Neuro-Symbolic Computing): it aims to integrate, the two most fundamental cognitive abilities: the ability to learn from experience (developed within ML), and the ability to reason from what has been learned (developed in mathematics, logic and early AI). It proceeds through the combination of ANNs (including DL and embedding based methods) with symbolic methods relying on symbolic reasoning and on explicit symbolic knowledge representations: the latter can take the form of algebraic equations, differential equations, spatial invariances, logic rules, probabilistic relations, simulation results, ontologies, knowledge bases, and graphs, to name the most prominent, that allow to formally represent expert or world knowledge.
A key point regarding all the ML algorithms refers to their generalization capabilities: it is obvious that by building models with a sufficiently high number of parameters it is possible to fit any data set; however, if the number of parameters is too high, the learned model might simply capture the peculiarities specific to the training dataset (as if learning-by-heart), and later perform very poorly when new test examples (not seen during the training) are provided for prediction. This phenomenon is called overfitting and it causes high variance in the test phase prediction (the opposite phenomenon, underfitting, is due to a too small number of parameters, which makes the model “too stiff” and typically introduces a bias in the test predictions). To rule out overfitting, and find a suitable bias-variance trade-off, one must find appropriate values for the hyperparameters of the model (number of parameters, and structure, as, e.g., number of layers and number of neurons per layers). To this purpose it is customary to let the Training phase be followed by a Validation phase: the labelled data are split into a subset used for the training, and a subset used for validation, and the two phases are repeated by changing the hyperparameters until the bias-variance trade-off is found to be satisfactory. The performance of the algorithm is finally gauged using a separate test dataset, in the Test phase.
Some classes of Deep Learning ANN counter the explosion of the number of parameters by adopting some simplified assumption, exploiting the structure of the problem at hand. Convolutional Neural Networks (CNNs), particularly successful in image processing, connect the input of each neuron to the output of a small number of neurons of the previous layer and try to learn an array of weights (a filter) shared across a whole layer. CNNs reflect the assumption of translation invariance of some space patterns: an eye has a given shape irrespectively of the positions in which happens to be captured by a photo; when the pattern of the eye has been learned, it will be useful across the whole image processing network: whichever neuron meets the pattern it will fire (the convolution of the eye and the filter will be high and pass a given threshold) and detect the presence of the eye. Other ANN such as the Recurrent Neural Networks (RNNs) exploit the dependence of the near future of a sequence from the recent past and build a predictive model from a long test sequence: the algorithm scans the sequence by means of a sliding window, which assumes time translation invariance to learn the weights.
Among the notable ANN architectures are the Graph Neural Networks (GNNs) which can take in input annotated graphs and output prediction under several forms including structured forms such as further graphs. This is relevant to cosmological investigations since the sparse nature of galaxy data suggests graphs as their best representation in ML models: for instance, features from neighbour nodes (galaxies) can be used to infer missing features of the central node, (e.g., the redshift of a galaxy), through a node-level regression/classification.
The goal of this introductory page is to show to the less expert reader, how, based on some symmetries of a problem, or other assumptions about the structure of the data, specially tailored architectures are typically identified. This allowed innovative breakthrough for specific problems, as in the case, e.g., of image processing with CNNs. Application to astrophysics and cosmology is no exception and so far most implementations tried to adapt the data to the algorithms using simple off-the-shelf ANN solutions. Rather, novel algorithms that can best adapt to the specific features of astronomical data should be investigated and customised. This is particularly true for cosmological applications in which the final goal is doing cosmological inference directly from the galaxy field, i.e., extracting the values of the fundamental cosmological parameters analysing the catalogues produced by large redshift surveys. For example, Graph Neural Networks (GNNs) are a class of algorithms that appear to reflect remarkably well the properties of galaxies, accounting for both their 3D distribution and multi-parameter physical properties. The development of these new techniques is of particular interest in view of the new generation of galaxy surveys of the 2020s. These include in particular the ESA Euclid satellite, due to launch in 2023, and in which the University of Milano has responsibility of coordinating the cosmological science.