# Seminars

## GENERATIVE ADVERSARIAL NETWORKS AS A NOVEL APPROACH FOR TECTONIC FAULT AND FRACTURE EXTRACTION IN HIGH-RESOLUTION SATELLITE AND AIRBORNE OPTICAL IMAGES

• Monday, June 01, 2020. Online.
• Bahram Jafrasteh (Inria)
• We develop a novel method based on Deep Convolutional Networks (DCN) to automate the identification and mapping of fracture and fault traces in optical images. The method employs two DCNs in a two players game: a first network, called Generator, learns to segment images to make them resembling the ground truth; a second network, called Discriminator, measures the differences between the ground truth image and each segmented image and sends its score feedback to the Generator; based on these scores, the Generator improves its segmentation progressively. As we condition both networks to the ground truth images, the method is called Conditional Generative Adversarial Network (CGAN). We propose a new loss function for both the Generator and the Discriminator networks, to improve their accuracy. Using two criteria and a manually annotated optical image, we compare the generalization performance of the proposed method to that of a classical DCN architecture, U-net. The comparison demonstrates the suitability of the proposed CGAN architecture. Further work is however needed to improve its efficiency.

## Explainability in machine learning models

• Monday, May 11, 2020. Online.
• Dr. Alvaro Barbero Jimenez (IIC)
• During the last few years machine learning models have proven to be successful tools in a wide variety of industrial applications, allowing the development of more accurate automated decision systems. Nevertheless, traditional or heavily regulated sectors such as law, banking or insurance are still reluctanct to adopt this technology given its lack of simple explanations supporting their decisions. Hence, solutions created for these sectors are usually restrained to simple model families such as linear models or small decision trees. Even in applications where these constraints do not exist, blatant failures in the decisions made by the model can be hard to debug and solve. Some recent studies also argue that machine learning models might amplify undesirable biases present in the training data, being difficult to pinpoint which component of the model (if any) is responsable for this effect. In this talk I will present a number of practical techniques to obtain explanations from any machine learning model, regardless of its complexity. These techniques range from measuring the global importance of each input feature in the model, to analyzing the impact of each feature for the prediction of a particular instance. I will introduce the concept of Shapley values, from the field of game theory, and show how they can be effectively used to produce useful model explanations. We will also see how these techniques are implemented in the scikit-learn and SHAP libraries.

## Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences

• Monday, April 06, 2020. Online.
• Dr. Alberto Suarez (UAM)
• This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related; for instance, the estimator of kernel ridge regression is identical to the posterior mean of Gaussian process regression. However, they have been studied and developed almost independently by two essentially separate communities, and this makes it difficult to seamlessly transfer results between them. Our aim is to overcome this potential difficulty. To this end, we review several old and new results and concepts from either side, and juxtapose algorithmic quantities from each framework to highlight close similarities. We also provide discussions on subtle philosophical and theoretical differences between the two approaches.

## Enhancing Bayesian Optimization with Combinatorial-valued variables

• Monday, March 09, 2020. Edif. B, B-351, EPS-UAM.
• Eduardo C. Garrido Merchan (UAM)
• Combinatorial variables provide lots of values that make standard Bayesian Optimization infeasible to optimize them. Dealing with this kind of values could enhance the hyperparameter tuning of machine learning methodologies, such as for example, ensembles, neural architecture search and combining kernels of SVMs in the multi task setting and kernels of Gaussian Processes. Ideally, we would like to provide a methodology that can deal with combinatorial valued variables along with other real, integer and categorical valued variables. Naive methodologies to address combinatorial-valued variables in Bayesian Optimization generate 2^D and 2*D dimensions where D are the basic values of the combinatorial-valued variable (for example, if we consider a combinatorial variable with values {void, A, B, C, AB, AC, BC, ABC), D=3 (A,B,C). In this work, we provide a really easy methodology that will only generate D dimensions, hence transforming the complexity of optimizing a combinatorial valued variable to the same complexity of a categorical valued variable. As the work is in implementation phase, I will also describe the machine learning applications of this methodology in the talk where I want to test the methodology.

## Sequential Training of Neural Networks with Gradient Boosting

• Monday, February 24, 2020. Edif. B, B-351, EPS-UAM.
• Dr. Gonzalo Martinez (UAM)
• This talk presents a novel technique based on gradient boosting to train a shallow neural network (NN). Gradient boosting is an additive expansion algorithm in which a series of models are trained sequentially to approximate a given function. A one hidden layer neural network can also be seen as an additive model where the scalar product of the responses of the hidden layer and its weights provide the final output of the network. Instead of training the network as a whole,the proposed algorithm trains the network sequentially in T steps. First, the bias term of the network is initialized with a constant approximation that minimizes the average loss of the data. Then, at each step, a portion of the network, composed of K neurons, is trained to approximate the pseudo-residuals on the training data computed from the previous iteration. Finally, the T partial models and bias are in- tegrated as a single NN with TK neurons in the hidden layer. We show that the proposed algorithm is more robust to overfitting than a standard neural network with respect to the number of neurons of the last hidden layer. Furthermore, we show that the proposed method design permits to reduce the number of neurons to be used without a significant reduction of its generalization ability. This permits to adapt the model to different classification speed requirements on the fly. Extensive experiments in classification and regression tasks, as well as in combination with a deep convolutional neural network, are carried out showing a better generalization performance than a standard neural network.

## Comparing weak and unsupervised Anomaly Detection at the LHC

• Monday, January 28, 2020. Edif. B, B-351, EPS-UAM.
• Pablo Martin (IFT-UAM)
• The Large Hadron Collider (LHC) has the potential to address many of the most fundamental questions in physics. The vast amount of data collected by this experiment requires powerful techniques to extract the relevant information. One of the most important problems to address at the LHC is how to classify entire events with complex topologies as signal or background correctly and efficiently. Given that the collected data are complex and high-dimensional and potential new physics may have subtle signatures, the significant improvements offered by deep learning are critical to fully exploit the rich datasets from LHC experiments. In particular, several studies have proved that weak-supervision and unsupervised techniques could be used to find new physics signals in an agnostic, universal way. In this talk, I will present two of these techniques and will compare how they perform on anomaly detection.

## Overview of classification methods for functional data

• Monday, January 13, 2020. Edif. B, B-351, EPS-UAM.
• Carlos Ramos (UAM)
• Functional data analysis is a branch of Statistics where the objects of interest are functions instead of vectors. The problem of supervised functional classification consists in assigning functions to previously defined groups, using the information obtained from preassigned training data. In this context, some classification methods from multivariate analysis can still be applied, while others must be adapted, and sometimes it is possible to take advantage of the additional structure of functional observations to develop methods specialized for this type of data. In this talk I will present the main families of classification methods for functional data and discuss the fundamental ideas behind them.

## Multi-target regression via input space expansion: treating targets as inputs

• Monday, November 11, 2019. Edif.B , B-351, EPS-UAM.
• Saman Emami (UAM)
• In many practical applications of supervised learning the task involves the prediction of multiple target variables from a common set of input variables. When the prediction targets are binary the task is called multi-label classification, while when the targets are continuous the task is called multi-target regression. In both tasks, target variables often exhibit statistical dependencies and exploiting them in order to improve predictive accuracy is a core challenge. A family of multi-label classification methods address this challenge by building a separate model for each target on an expanded input space where other targets are treated as additional input variables. Despite the success of these methods in the multi-label classification domain, their applicability and effectiveness in multi-target regression has not been studied until now. In this paper, we introduce two new methods for multi-target regression, called stacked single-target and ensemble of regressor chains, by adapting two popular multilabel classification methods of this family. Furthermore, we highlight an inherent problem of these methodsa discrepancy of the values of the additional input variables between training and predictionand develop extensions that use out-of-sample estimates of the target variables during training in order to tackle this problem. The results of an extensive experimental evaluation carried out on a large and diverse collection of datasets show that, when the discrepancy is appropriately mitigated, the proposed methods attain consistent improvements over the independent regressions baseline. Moreover, two versions of Ensemble of Regression Chains perform significantly better than four state-of-the-art methods including regularization-based multi-task learning methods and a multi-objective random forest approach.

## A Gaussian Process Model for Multi-class Classification with Noisy Inputs

• Monday, November 4, 2019. Edif.B , B-351, EPS-UAM.
• Carlos Villacampa Calvo (UAM)
• Multi-class classification problems arise in a huge variety of fields, from industry to science. Particularly in the case of the latter, it is common to have datasets whose inputs are the result of experimental measurements, which unavoidably come with associated uncertainties. While inference tasks on data with noisy attributes have been considered since long time in the context of regression, up to our knowledge there is no analogous study in the literature concerning classification problems. Motivated by a concrete classification problem and a dataset coming from astrophysics, in this work the above issue is addressed. We incorporate the uncertainties associated to the input dimensions in the construction of a Gaussian Process (GP) model for solving multi-class classification problems.

## A Convex Formulation of SVM-based Multi-Task Learning

• Monday, October 28, 2019. Edif.B , B-351, EPS-UAM.
• Carlos Ruiz Pastor (UAM)
• Multi-task learning (MTL) is a powerful framework that allows to take advantage of the similarities between several machine learning tasks to improve on their solution by independent task specific models. Support Vector Machines (SVMs) are well suited for this and Cai \emph{et al.} have proposed additive MTL SVMs, where the final model corresponds to the sum of a common one shared between all tasks, and each task specific model. In this work we will propose a different formulation of this additive approach, in which the final model is a convex combination of common and task specific ones. The convex mixing hyper-parameter $\lambda$ takes values between $0$ and $1$, where a value of $1$ is mathematically equivalent to a common model for all the tasks, whereas a value of $0$ corresponds to independent task-specific models. We will show that for $\lambda$ values between $0$ and $1$, this convex approach is equivalent to the additive one of Cai \emph{et al.} when the other SVM parameters are properly selected. On the other hand, the predictions of the proposed convex model are also convex combinations of the common and specific predictions, making this formulation easier to interpret. Finally, this convex formulation is easier to hyper-parametrize since the hyper-parameter $\lambda$ is constrained to the $[0, 1]$ region, in contrast with the unbounded range in the additive MTL SVMs.

## Least Lp Support Vector Machines

• Monday, October 21, 2019. Edif.B , B-351, EPS-UAM.
• Dr. Carlos María Alaíz Gudín (UAM)
• This work aims to extend the Least Squares Support Vector Machine model, also known as Kernel Ridge Regression, to use a general Lp loss that, as shown experimentally. will determine the distribution of the errors. The resultant dual problem contains a penalization using the dual Lq norm. As limit cases, when p tends to infinity the resultant problem is a Lasso-like kernel model that will be sparse over the coefficients corresponding to the patterns (dual variables), instead of those corresponding to the input features (primal variables), and that can be used for signal compression. On the other side, when p tends to one the standard Support Vector Regression model, with the insensitivity equal to zero, is recovered.

## Deep Gaussian Processes with Importance-Weighted Variational Inference

• Monday, October 7, 2019. Edif.B , B-351, EPS-UAM.
• Dr. Daniel Hernández Lobato. (UAM)
• Deep Gaussian processes (DGPs) can model complex marginal densities as well as complex mappings. Non-Gaussian marginals are essential for modelling real-world data, and can be generated from the DGP by incorporating uncorrelated variables to the model. Previous work on DGP models has introduced noise additively and used variational inference with a combination of sparse Gaussian processes and mean-field Gaussians for the approximate posterior. Additive noise attenuates the signal, and the Gaussian form of variational distribution may lead to an inaccurate posterior. We instead incorporate noisy variables as latent covariates, and propose a novel importance-weighted objective, which leverages analytic results and provides a mechanism to trade off computation for improved accuracy. Our results demonstrate that the importance-weighted objective works well in practice and consistently outperforms classical variational inference, especially for deeper models.

## No-Regret Bayesian Optimization with Unknown Hyperparameters.

• Monday, September 30, 2019. Edif.B , B-351, EPS-UAM.
• Eduardo C. Garrido Merchán (UAM)
• Bayesian optimization (BO) based on Gaussian process models is a powerful paradigm to optimize black-box functions that are expensive to evaluate. While several BO algorithms provably converge to the global optimum of the unknown function, they assume that the hyperparameters of the kernel are known in advance. This is not the case in practice and misspecification often causes these algorithms to converge to poor local optima. In this paper, we present the first BO algorithm that is provably no-regret and converges to the optimum without knowledge of the hyperparameters. During optimization we slowly adapt the hyperparameters of stationary kernels and thereby expand the associated function class over time, so that the BO algorithm considers more complex function candidates. Based on the theoretical insights, we propose several practical algorithms that achieve the empirical sample efficiency of BO with online hyperparameter estimation, but retain theoretical convergence guarantees. We evaluate our method on several benchmark problems.

## Diffusion Variational Autoencoders

• Monday, July 08, 2019. Edif.B , B-351, EPS-UAM.
• Dra. Angela Fernández Pascual. (UAM)
• In this seminar we will talk about the paper of H. Li et al. "Diffusion Variational Autoencoders". Variational Autoencoders (VAEs) have become one of the most popular deep learning approaches to unsupervised learning and data generation. However, traditional VAEs suffer from the constraint that the latent space must follow a simple prior (e.g. normal, uniform), independent of the initial data distribution. This work proposes a variational autoencoder that maps manifold valued data to its diffusion map coordinates in the latent space, resamples in a neighborhood around a given point in the latent space, and learns a decoder that maps the newly re-sampled points back to the manifold. The framework is built off of SpectralNet (Shaham et al, 2018---we will review also this work), and is capable of learning this data dependent latent space without computing the eigenfunction of the Laplacian explicitly. In the paper it is proved that the diffusion variational autoencoder framework is capable of learning a locally bi-Lipschitz map between the manifold and the latent space, and that the proposed resampling method around a point in the latent space maps points back to the manifold around the original point.

## Introduccin a la programacin cuntica

• Lunes, 01 de Julio de 2019, 12:00. B-351, EPS-UAM
• Dr. Alberto Surez. (UAM)
• En este seminario haremos una breve introduccin a la programacin de ordenadores cunticos basados en cbits. La operacin de un computador de n cbits se basa en la manipulacin del estado conjunto de dichos cbits mediante puertas lgicas cunticas. Estas puertas se articulan en circuitos lgicos cunticos que transforman el estado inicial del sistema de n cbits en un estado final. El objetivo es disear el circuito de forma que de este estado final se pueda leer la solucin del problema planteado. A diferencia de los bits clsicos, los bits cunticos pueden estar en estados de superposicin y entrelazados. Las propiedades de estos estados son tales que, en algunos casos, hacen posible resolver problemas mediante ordenadores cunticos con ventaja respecto a los clsicos. Se ilustrar dicha ventaja en la implementacin del algoritmo de Deutsch-Jozsa para determinar si una funcin de n bits clsicos est balanceada o es constante.

## Policy Gradient Methods for Deep Reinforcement Learning

• Monday, June 10, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Alvaro Barbero. (IIC)
• In the past 4 years new results in the field of Reinforcement Learning have extended this classic branch of Machine Learning into new application fields. This includes highly non-linear and complex scenarios such as solving the game of Go, controlling robotic hands, or strategic planning based on visual information. Much of these successes stems from the development of learning algorithms that allow a deep neural network to learn a control policy directly over raw observations, without the traditional requirement of a model of the problem being solved. In this talk we will overview a family of such algorithms known as Policy Gradients, which fit nicely with recent developments in Deep Learning architectures and frameworks.

## Optimal Step-Size for FISTA

• Monday, May 27, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Carlos María Alaíz. (UAM)
• FISTA is a Proximal Method that aims to minimize the sum of a smooth term and a non-smooth one, what makes it specially suited for the optimization problems that arise in the Regularized Learning framework. A major issue of FISTA is that, in its classical formulation, it requires to know the Lipschitz constant of the gradient of the smooth term, or to estimate it using a backtracking strategy, in order to define the step-size. In this work, a new approach is proposed to estimate the largest step-size that satisfies a fundamental inequality of the standard convergence proof of the algorithm. Several experimental results show how this new criterion can outperform the previous approaches.

## Flexible Kernel Selection in Multitask Support Vector Regression

• Monday, May 13, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Carlos Ruiz (UAM)
• Multitask Learning (MTL) aims to solve several related problems at the same time exploiting the similarities between them. In particular, Support Vector Machines (SVMs) can be used for MTL, providing a model that is potentially more flexible than a classical SVM trained over the data of all the tasks, and which can use more information than a set of independent models trained over each one of the tasks. Nevertheless, a major drawback of these SVMs is the large number of hyperparameters if no simplifying assumptions are made, which prevents from using standard selection methods as grid or random searches. In particular, both the common kernel and the kernels associated to each one of the tasks have to be selected. In this paper, we propose an approach to choose these kernels based on Gaussian Processes (GPs), whose Bayesian perspective allows one to deal naturally with several parameters. In particular, a GP is trained for each task, and the resultant kernel parameters are transferred to the SVM-based MTL model. Several experiments in real-world datasets show empirically the usefulness of this approach and the advantages of the GP-based kernel selection method.

## An Automatic Method for the Identification of Cystine Crystals in Urine Sediment

• Monday, May 6, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Vicenzo Abichequer. (UAM)
• The analysis of urine sediments (Urinalysis) is a common procedure in diagnostic laboratories, and allows the identification of important diseases. This procedure is usually done by a professional, which performs a visual inspection of microscope slides containing urine, aiming to identify crystals, bacteria and other relevant elements, resulting in a laborious task. The automation of this task is of great value to medicine and related areas, raising the quality and reliability of diagnosis and reducing the time spent with these tasks. This paper describes a new method for automating the analysis of urine sediments in digital images that has the main goal of finding cystine crystals. Real images obtained from a microscope and from some public image databases were used to test the developed algorithm, which demonstrated satisfactory results with 73.72% and 93.08% of sensitivity and specificity, respectively.

## Scikit-datasets workshop

• Monday, April 8, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• David Díaz Vico. (UAM)
• The setup of a Machine Learning experiment most often involves the tedious process of searching, downloading and preprocessing data. Fortunately, several well-known data repositories are available, but their directory structure and file formats are not normalized and the use of several tools in order to download, decompress and reformat is usually required. From the point of view of the Machine Learning practitioner, this is mostly a necessary but uninteresting process. Scikit-datasets (https://github.com/daviddiazvico/scikit-datasets) is a Python library developed by David Daz-Vico and Carlos Ramos, designed to be fully compatible with the popular Scikit-learn library, that aims to free the user from the task of gathering data so that all the effort can be focused on the Machine Learning core of the experiment.

## Classification without labels: Learning from mixed samples in high energy physics

• Monday, April 1, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Bryan Zaldivar. (IFT-UAM)
• Modern machine learning techniques can be used to construct powerful models for difficult collider physics problems. In many applications, however, these models are trained on imperfect simulations due to a lack of truth-level information in the data, which risks the model learning artifacts of the simulation. In this paper, we introduce the paradigm of classification without labels (CWoLa) in which a classifier is trained to distinguish statistical mixtures of classes, which are common in collider physics. Crucially, neither individual labels nor class proportions are required, yet we prove that the optimal classifier in the CWoLa paradigm is also the optimal classifier in the traditional fully-supervised case where all label information is available. After demonstrating the power of this method in an analytical toy example, we consider a realistic benchmark for collider physics: distinguishing quark- versus gluon-initiated jets using mixed quark/gluon training samples. More generally, CWoLa can be applied to any classification problem where labels or class proportions are unknown or simulations are unreliable, but statistical mixtures of the classes are available.

## Flexible Variational Approximations for Deep Gaussian Processes

• Monday, March 25, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Carlos Villacampa Calvo
• Deep Gaussian processes (DGPs) are a multi-layer hierarchical generalization of Gaussian processes (GPs) that have been shown to provide better generalization performance and better uncertainty estimation of the target values than standard GPs. Notwithstanding, exact inference is infeasible and very challenging in DGPs, even for regression problems, since the predictive distribution in the second layer and above is no longer Gaussian due to the randomness of the inputs coming from the previous layer. Therefore, in practical applications, one needs to use approximate inference techniques, such as Variational Inference (VI). While VI allows us to efficiently train DGPs, the approximate posterior distribution is often a parametric distribution, i.e., a multivariate Gaussian distribution, which may result in approximation bias. In this talk, we propose to relax the Gaussianity assumption of the approximate posterior distribution by introducing extra layers of latent variables in both the DGP model and the posterior approximation.

## Diffusion Autoencoders

• Thursday, March 25, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Extending work by Mishne et al. (Diffusion Nets), we propose Deep Diffusion Autoencoders (DDA) that learn an encoder-decoder map using a composite loss function that simultaneously minimizes the reconstruction error at the output layer and the distance to a Diffusion Map embedding in the bottleneck layer. These DDA are thus able to reconstruct new patterns from points in the embedding space in a way that preserves geometry of the sample and, as a consequence, our experiments show that they may provide a powerful tool for data augmentation.

## Unbiased Implicit Variational Inference

• Thursday, March 07, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Daniel Hernández Lobato. (UAM)
• We develop unbiased implicit variational inference (UIVI), a method that expands the applicability of variational inference by defining an expressive variational family. UIVI considers an implicit variational distribution obtained in a hierarchical manner using a simple re-parameterizable distribution whose variational parameters are defined by arbitrarily flexible deep neural networks. Unlike previous works, UIVI directly optimizes the evidence lower bound (ELBO) rather than an approximation to the ELBO. We demonstrate UIVI on several models, including Bayesian multinomial logistic regression and variational autoencoders, and show that UIVI achieves both tighter ELBO and better predictive performance than existing approaches at a similar computational cost.

## Deep Neural Decision Forests

• Thursday, February 28, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Gonzalo Martínez. (UAM)
• We present a novel approach to enrich classification trees with the representation learning ability of deep (neural) networks within an end-to-end train-able architecture. We combine these two worlds via stochastic and differentiable decision tree model, which steers the formation of latent representations within the hidden layers of a deep network. The proposed model differs from conventional deep networks in that a decision forest provides the final predictions and it differs from conventional decision forests by introducing a principled, joint and global optimization of split and leaf node parameters. Our approach compares favorably to other state-of-the-art deep models on a large-scale image classification task like ImageNet.

## Mutually Exclusive Spike-and-Slab Priors for Structured Feature Selection in Multi-class Classification Problems

• Thursday February 21, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Alejandro Catalina. (UAM)
• Many multi-class classification problems are characterized by a small number of observed instances N and a large number of explaining variables P. Under this setting, even a simple linear model is too complex and can lead to severe over-fitting. A strong regularization that has been proven to give better generalization results is to assume sparsity in the coefficients of the linear model. That is, many coefficient values are equal to zero. This has also the advantage of improving interpretability since only a reduced number of explaining attributes will contribute to the predictions made by the model. A very successful approach to introduce the sparsity assumption under a Bayesian setting is to assume a spike-and-slab prior for the model coefficients. Consider K different classes in the problem. Under this setting, there are K different hyper- planes to be estimated, i.e., one hyper-plane wk for each different class k = 1,..,K. Assume the labeling rule is yi = arg maxk xTi wk + ki , with ki some Gaussian noise. Let W RKP be a matrix summarizing all the hyper-planes. Under a spike-and-slab prior, each entry of W, wjk, is assumed to be generated from either a Gaussian distribution with zero mean and fixed variance or from a point of probability mass at zero. Nevertheless, there are several different sparsity patterns that can be introduced in the matrix W. For example, one can consider that one can observe zeros arbitrarily at any entry. Another approach is to assume that row entries have to be either all equal to zero or different from zero. This is equivalent to assuming features relevant either for all K class labels or completely irrelevant at all. In this work, we explore a third sparsity pattern in W. Namely, that if one feature is relevant, it will only be relevant to discriminate a particular class label from all the others. This is equivalent to assuming that either all entries in a row of W are equal to zero, or only one entry in that row (i.e., the one corresponding to the class label) is different from zero. This has the advantage over the previous sparsity patterns of improving the interpretability of the model. With this goal, we describe a novel spike-and-slab prior distribution for the model coefficients W. Exact Bayesian inference in a model under such a prior is intractable. Therefore, one has to use approximate methods in practice. For this, we describe an efficient algorithm based on expectation propagation. We plan to evaluate the performance of such a prior (and approximate inference method) in terms of prediction error and interpretability in several multi-class problems, involving real and synthetic data. A comparison with Markov Chain Monte Carlo methods will also be carried out to assess the bias in the approximation obtained by expectation propagation.

## Towards Calibrated Probabilistic Classifiers for Optimal Interpretation and Decision-Making.

• Monday, January 21, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Daniel Ramos (UAM)
• In this talk, we will firstly address the problem of optimal decision-making involving binary (two-class) classifiers. We will start by exemplifying these problems in the area of speech processing, in particular in the task of speaker recognition. Due to the diversity of techniques implemented for these tasks, their outputs, typically expressed as some kind of score, are often meaningless if not compared to reference values or to a threshold; and not easily comparable among systems in terms of interpretability, interoperability or combination. Fortunately, calibration has been proposed as a way of allowing classifier outputs to be mapped to a common probabilistic framework, where consistent comparison and interpretation are possible. We will describe the concept of calibration, citing some techniques to transform the scores of any binary classifier into probabilistic values. We will state the problem and its elements in an information-theoretical way to provide additional insights. Some examples of the advantages of using this probabilistic framework will be also shown in the context of NIST Speaker Recognition Evaluations. Finally, we will enumerate current lines of research that are related to other areas and tasks where these ideas can be applied or extended, such as forensic chemistry (e.g., comparison of phisico-chemical profiles of glass fragments under severe data scarcity) and multiclass calibration (e.g. calibrated outputs of computer vision DNN models).

## Neural Ordinary Differential Equations.

• Thursday, January 17, 2019, 12:00 h. Edif.B , B-351, EPS-UAM.
• Carlos Ramos Carreño (UAM)
• We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a blackbox differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.

## What Do Non-Linear SVMs See?

• Thursday, December 13, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Carlos María Alaíz (UAM)
• A proper visualization of a learning machine can be crucial to understand its behaviour in a certain problem, allowing thus to identify and solve the possible difficulties that may arise in order to improve its performance. In this talk I will present some interesting results that we came across when looking for a proper representation of a non-linear SVM model. Although the SVM is an intuitive model with a clear geometrical interpretation, the implicit embedding present when a non-linear kernel is used (such as the RBF kernel) prevents from visualizing it directly. In this work we introduce a very simple approach that can give an idea of the space where the SVM is solving the problem. Several experiments illustrate the proposed method.

## Semi-Supervised Data Visualization

• Thursday, November 29, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dra. Ángela Fernández Pascual (UAM)
• The objective of this work is to design a semi-supervised approach to visualize data. The proposed method aims to provide a preliminary idea about the structure of the data taking into account partial information about the classes of the patterns, using for this different manifold learning techniques where the kernel is modified to include the available labels. Although this work is still in progress, the preliminary results show that this method improves the visualization thanks to the incorporated supervised information.

## Analyzing gamma-rays of the Galactic Center with Deep Learning

• Thursday, November 15, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Bryan Zaldivar (IFT-UAM)
• In this talk I will discuss a paper dealing with the well-known "Galactic Center Excess", about an unexpectedly large emission of photons in the gamma-ray spectrum from the vicinity of the Milky Way's center. The astrophysics community is trying to elucidate whether this emission is the well-sought smoking gun of a dark matter signal. However, it may just be caused by pulsars. The authors train convolutional neural networks in order to determine how plausible is one solution vs. the other.

## Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models.

• Thursday, October 18, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Eduardo C. Garrido-Merchán (UAM)
• Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this seminar I will describe a model that performs this task, the Gaussian process latent variable model (GP-LVM). I will provide an overview of some existing techniques for discovering such embeddings and introduce a probabilistic interpretation of principal component analysis (PCA), the dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be non-linearised through Gaussian processes, which is a GP-LVM.

## Sacred and Sacredboard

• Thursday, May 31, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Carlos Ramos Carreño (UAM)
• In this seminar, I will talk about Sacred, a tool to help you configure, organize, log and reproduce experiments in Python. I will introduce how an example experiment can be set up in Sacred, explaining the options that Sacred provides to configure, program and execute experiments, and store the results of an experiment. I will also introduce Sacredboard, a dashboard to show and filter Sacred experiments.

## Deep Fisher Discriminant Analysis

• Thursday, May 17, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• David Díaz Vico (UAM)
• Fisher Discriminant Analysis linear nature and the usual eigen-analysis approach to its solution have limited the application of its underlying elegant idea. In this work we will take advantage of some recent partially equivalent formulations based on standard least squares regression to develop a simple Deep Neural Network (DNN) extension of Fisher analysis that greatly improves on its ability to cluster sample projections around their class means while keeping these apart. This is shown by the much better accuracies and g scores of class mean classifers when applied to the features provided by simple DNN architectures than what can be achieved using Fisher linear ones.

## Vote-boosting ensembles

• Thursday, May 10, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Maryam Sabzevari (UAM)
• Vote-boosting is a sequential ensemble learning method in which the individual classifiers are built on different weighted versions of the training data. To build a new classifier, the weight of each training instance is determined in terms of the degree of disagreement among the current ensemble predictions for that instance. For low class-label noise levels, especially when simple base learners are used, emphasis should be made on instances for which the disagreement rate is high. When more flexible classifiers are used and as the noise level increases, the emphasis on these uncertain instances should be reduced. In fact, at sufficiently high levels of class-label noise, the focus should be on instances on which the ensemble classifiers agree. The optimal type of emphasis can be automatically determined using cross-validation. An extensive empirical analysis using the beta distribution as emphasis function illustrates that vote-boosting is an effective method to generate ensembles that are both accurate and robust.

## A Unifying View of Sparse Approximate Gaussian Process Regression

• Thursday, April 19, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Gonzalo Hernández Muñoz (UAM)
• We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs.

## Modified Frank Wolfe Algorithm for Enhanced Sparsity in Support Vector Machine Classifiers

• Thursday, April 19, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Carlos María Alaíz (UAM)
• This work proposes a new algorithm for training a re-weighted L2 Support Vector Machine (SVM), inspired on the re-weighted Lasso algorithm of Cands et al. and on the equivalence between Lasso and SVM shown recently by Jaggi. In particular, the margin required for each training vector is set independently, defining a new weighted SVM model. These weights are selected to be binary, and they are automatically adapted during the training of the model, resulting in a variation of the FrankWolfe optimization algorithm with essentially the same computational complexity as the original algorithm. As shown experimentally, this algorithm is computationally cheaper to apply since it requires less iterations to converge, and it produces models with a sparser representation in terms of support vectors and which are more stable with respect to the selection of the regularization hyper-parameter.

## Parallel Predictive Entropy Search for Batch Global Optimization of Expensive Objective Functions and its potential extension to Constrained-Multiobjective Scenarios

• Thursday, April 12, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Eduardo C. Garrido Merchán (UAM)
• This seminar will dive into the Parallel Predictive Entropy Search (PPES) method, a novel algorithm for Bayesian optimization of expensive black-box objective functions. At each iteration, PPES aims to select a batch of points which will maximize the information gain about the global maximizer of the objective. Well known strategies exist for suggesting a single evaluation point based on previous observations, while far fewer are known for selecting batches of points to evaluate in parallel. The few batch selection schemes that have been studied all resort to greedy methods to compute an optimal batch. Synthetic and real world applications, including problems in machine learning, rocket science and robotics show the benefits of using this method. I will close the seminar with a preliminary exposition of an extension of PPES into a Constrained-Multiobjective scenario.

## Variational Inference for Gaussian Process Models with Linear Complexity

• Thursday, March 22, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Carlos Villacampa (UAM)
• Large-scale Gaussian process inference has long faced practical challenges due to time and space complexity that is superlinear in dataset size. While sparse variational Gaussian process models are capable of learning from large-scale data, standard strategies for sparsifying the model can prevent the approximation of complex functions. In this work, the authors propose a variational Gaussian process model that decouples the representation of mean and covariance functions in reproducing kernel Hilbert space, achieving a generalization of previous models, as well as a gain in terms of time and space complexity, regardless of the choice of kernels, likelihoods, and inducing points. This strategy makes the adoption of large-scale expressive Gaussian process models possible. Several experiments on regression tasks confirm that this decoupled approach greatly outperforms previous sparse variational Gaussian process inference procedures.

## Faster hyper-parameter search via conjugate SMO

• Thursday, March 8, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Alberto Torres (UAM)
• Conjugate Gradient Descent is a classic acceleration technique that is able to improve the convergence of Gradient Descent by adding a momentum term. In this work we review the classic optimization theory and explore adding the same momentum term to the SMO algorithm, which is the state-of-the-art solver for both non-linear SVC and SVR. Experiments comparing standard SMO and Conjugate SMO are carried out, both in terms of iterations and execution time. We also try to get insight on what type of problems the conjugate version is able to obtain a meaningful advantage. Finally we explore a hyper-parametrization setting, where we care not only about solving a single model but also about searching for the best C, gamma and epsilon values in a grid.

## Accelerating Composite Minimisation

• Thursday, March 1, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Alejandro Catalina Feliú (UAM)
• In this seminar we review the Forward-Backward splitting algorithm for solving composite minimisation problems. A widely known acceleration to this algorithm is to add a momentum sequence, proposed by Nesterov, resulting in the FISTA algorithm. Finally, we will review several recent acceleration techniques for this algorithm and apply them to some synthetic and real-world problems showing how these can be used to improve upon the performance of a state of art Block Coordinate Algorithm for the Sparse Group Lasso model.

## A practical tutorial of Tensorflow

• Thursday, February 15, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• María Cortés (UAM)
• TensorFlow is an open-source software library for dataflow programming used for machine learning applications such as neural networks or Gaussian processes. This talk consists in a short tutorial of this useful tool.

## Is XGBoost the best possible classifier? A comparative analysis

• Thursday, February 1, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Gonzalo Martínez Muñoz (UAM)
• XGBoost is a scalable ensemble technique based on gradient boosting that has demonstrated to be a reliable and efficient machine learning challenge solver. This work proposes a practical analysis of how this novel technique works in terms of training speed, generalization performance and parameter setup. In addition a comprehensive comparison between XGBoost, Random Forests and Gradient Boosting classifiers has been performed using default and tuned parameters for all three models. This comparison may indicate that XGBoost is not necessarily the best.

## An empirical analysis of heterogeneous ensemble as ensembles of homogeneous ensembles

• Thursday, January 11, 2018, 12:00 h. Edif.B , B-351, EPS-UAM.
• Maryam Sabzevari (UAM)
• In ensemble methods, the outputs of a collection of diverse classifiers are combined in the expectation that the final ensemble prediction be more accurate than the individual ones. Heterogeneous ensembles consist of predictors of different types, which are likely to have different biases. If these biases are complementary, in the sense that the different types of classifiers err on different instances, the combination of their decisions can be beneficial. In this work, we propose to investigate whether the increased diversity of heterogeneous ensembles with respect to homogeneous ones can be exploited to obtain more accurate prediction systems. To this end, we consider ensembles of M homogeneous ensembles of different types. Heterogeneous ensembles built in this manner can be seen as a convex combination of the basis homogeneous ensembles. Therefore, depending on the weight of the homogeneous ensembles, a particular heterogeneous combination can be represented by a point in a regular simplex in M dimensions.. The optimal combination of homogeneous ensembles can be determined using cross-validation or, if bootstrap samples are used to build the individual classifiers, out-of-bag data. An empirical analysis of such combinations of bootstraped ensembles composed of neural networks, SVMs, and random trees (i.e. a standard random forest) illustrates the gains that can be achieved by this method.

## Curso doctoral EPS-UAM: Clasificacin ordinal. Dr. Pedro A. Gutirrez (UCO), 22-25 de enero de 2018

• Curso doctoral: Clasificacin ordinal
• Docente: Dr. Pedro Antonio Gutirrez. Escuela Politécnica Superior. Universidad de Córdoba.
• Fechas: 22 a 24 de enero de 2018, de 11:00-13:00 y 25 de enero de 2018, de 10:00-12:00.
• Lugar: Laboratorio 14, Edificio A. EPS-UAM.
• Registro: Registrarse
• Resumen: Dentro del aprendizaje automtico, es frecuente encontrar problemas de clasificacin en los que las etiquetas de clase pueden ordenarse segn una determinada escala. Por ejemplo, el diagnstico de la Enfermedad de Parkinson a partir de imgenes funcionales puede realizarse en funcin del grado de afeccin en la imagen: paciente sano, afeccin leve, moderada y grave. Esta naturaleza ordinal de los datos puede y debe ser explotada para obtener clasificadores ms robustos y capaces de realizar predicciones lo ms cercanas posibles (en la escala ordinal) a la etiqueta real. Existe una gran cantidad de reas de aplicacin en las que los problemas de clasificacin ordinal son especialmente relevantes, como medicina, economa o sociologa. Este curso pretende introducir las caractersticas fundamentales de los problemas de clasificacin ordinal, junto con las principales tcnicas que han sido aplicadas en aprendizaje automtico para tratarlos de manera especfica. Entre los contenidos se incluyen: Nociones bsicas de aprendizaje automtico: clasificacin nominal. Definicin de clasificacin ordinal y de sus caractersticas. Mtricas de evaluacin de bondad en clasificacin ordinal. Taxonoma de algoritmos para clasificacin ordinal. Mtodos de preprocesamiento en clasificacin ordinal.
• Mas información: Ver mas detalles

## Seminario de investigacin: Aprendizaje semi-supervisado en entornos de clasificacin desequilibrada: creacin de un modelo de emparejamiento donante-receptor para trasplantes de hgado

• Ponente: Dr. Pedro Antonio Gutirrez (UCO).
• Fecha y lugar: Jueves 25 de Enero del 2018, 12:00. Sala de grados A EPS-UAM.
• Resumen: El trasplante de hgado es un tratamiento esperanzador y ampliamente aceptado para los pacientes con una enfermedad terminal de hgado. Sin embargo, este tratamiento est limitado por la falta de donantes, que provoca muchas muertes en lista de espera. Nuestro trabajo propone un nuevo sistema de emparejamiento donante-receptor que utiliza aprendizaje automtico para predecir la supervivencia del injerto tras el trasplante, utilizando para ello una base de datos de trasplantes realizados en el hospital Kings College de Londres. Desde el punto de vista metodolgico, la principal novedad del sistema es que abordamos el desequilibrio del problema en cuanto al nmero de patrones por clase considerando aprendizaje semi-supervisado y analizando su potencial para obtener modelos ms robustos y equitativos. De esta forma, proponemos dos fuentes distintas de datos no etiquetados (trasplantes recientes cuyo resultado an no se conoce y emparejamientos virtuales donante-receptor), junto con dos mtodos para utilizar estos datos en la construccin del modelo (un algoritmo semi-supervisado y un esquema de propagacin de etiquetas). Demostramos como los pares virtuales y el mtodo de propagacin de etiquetas son capaces de aliviar el problema del desequilibrio, suponiendo una forma novedosa de abordar este tipo de problemas. Los resultados obtenidos muestran que el uso conjunto de informacin real y sinttica ayuda a mejorar y estabilizar el rendimiento del modelo y lleva a decisiones ms justas. Finalmente, proponemos utilizar el modelo desarrollado junto con un criterio de severidad asociado al receptor, para llegar a un compromiso entre la gravedad del paciente y el pronstico del trasplante.
• CV Ponente: Pedro Antonio Gutirrez Pea obtuvo el doctorado en Informtica por la Universidad de Granada, el ttulo de Ingeniero en Informtica por la Universidad de Sevilla y el Mster en Soft Computing y Sistemas Inteligentes tambin por la Universidad de Granada. Actualmente es Profesor Titular del Departamento de Informtica y Anlisis Numrico de la Universidad de Crdoba, habiendo trabajado anteriormente en el Institutito de Agricultura Sostenible del CSIC. Pertenece al grupo de investigacin AYRNA (Aprendizaje y Redes Neuronales Artificiales). Su labor de investigacin est centrada en el aprendizaje automtico, abarcando el diseo de redes neuronales artificiales mediante tcnicas bioinspiradas, el desarrollo y evaluacin de modelos para clasificacin ordinal y la aplicacin de todas estas tcnicas a problemas reales en energas renovables o biomedicina.

## Optimal designs for longitudinal and functional data

• Thursday, December 14, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Carlos Ramos Carreo (UAM)
• We propose novel optimal designs for longitudinal data for the common situation where the resources for longitudinal data collection are limited, by determining the optimal locations in time where measurements should be taken. As for all optimal designs, some prior information is needed to implement the optimal designs proposed. We demonstrate that this prior information may come from a pilot longitudinal study that has irregularly measured and noisy measurements, where for each subject one has available a small random number of repeated measurements that are randomly located on the domain. A second possibility of interest is that a pilot study consists of densely measured functional data and one intends to take only a few measurements at strategically placed locations in the domain for the future collection of similar data. We construct optimal designs by targeting two criteria: optimal designs to recover the unknown underlying smooth random trajectory for each subject from a few optimally placed measurements such that squared prediction errors are minimized; optimal designs that minimize prediction errors for functional linear regression with functional or longitudinal predictors and scalar responses, again from a few optimally placed measurements. The optimal designs proposed address the need for sparse data collection when planning longitudinal studies, by taking advantage of the close connections between longitudinal and functional data analysis. We demonstrate in simulations that the designs perform considerably better than randomly chosen design points and include a motivating data example from the Baltimore Longitudinal Study of Aging. The designs are shown to have an asymptotic optimality property.

## Near-perfect classification of Gaussian processes

• Thursday, November 30, 2017, 12:00 h. Edif.B , B-428, EPS-UAM.
• Dr. Alberto Surez Gonzlez (UAM)
• In machine learning problems with functional data, the instances available for induction are characterized by a function of a continuous parameter (e.g. time or space). These problems exhibit significant qualitative differences with their multivariate counterparts. One of the most striking properties is the possibility of achieving zero error in many functional classification problems. In this talk, we provide a complete characterization of near-perfect classification for data that are characterized by trajectories sampled from Gaussian processes. The characterization is based on the analysis of Bayes' rule for this functional problem. This optimal classifier is the singular limit of the quadratic discriminant, which is Bayes' rule for the discrete-time approximation of the continuous-time process. As a side result of this analysis, we provide a general rule for the equivalence of any two Gaussian processes that generalizes Varberg and Shepp's results for Brownian processes.

## Doubly Stochastic Variational Inference for Deep Gaussian Processes

• Thursday, November 23, 2017, 12:00 h. Edif.B , B-428, EPS-UAM.
• Gonzalo Hernndez Muoz (UAM)
• Gaussian processes (GPs) are a good choice for function approximation as they are flexible, robust to overfitting, and provide well-calibrated predictive uncertainty. Deep Gaussian processes (DGPs) are multi-layer generalizations of GPs, but inference in these models has proved challenging. Existing approaches to inference in DGP models assume approximate posteriors that force independence between the layers and do not work well in practice. We present a doubly stochastic variational inference algorithm that does not force independence between layers. With our method of inference, we demonstrate that a DGP model can be used effectively on data ranging in size from hundreds to a billion points. We provide strong empirical evidence that our inference scheme for DGPs works well in practice in both classification and regression.

## A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation

• Thursday, November 16, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Carlos Villacampa Calvo (UAM)
• Gaussian processes (GPs) are flexible distributions over functions that enable high-level assumptions about unknown functions to be encoded in a flexible and general way. Although elegant, the application of GPs is limited by computational and analytical intractabilities that arise when data are sufficiently numerous or when employing non-Gaussian models. Consequently, a wealth of GP approximation schemes have been developed over the last years to address these key limitations. Many of these schemes employ a small set of pseudo data points to summarize the actual data. In this paper, the authors develop a new pseudo-point approximation framework using Power Expectation Propagation (Power EP) that unifies a large number of these pseudo-point approximations. Unlike much of the previous work in this area, the new framework is built on standard methods for approximate inference (variational free-energy, EP and Power EP methods) rather than employing approximations to the probabilistic generative model itself. Crucially, they demonstrate that the new framework includes new pseudo-point approximation methods that outperform current approaches on regression and classification tasks.

## Dealing with Integer-valued Variables in Bayesian Optimization with Gaussian Processes

• Thursday, November 2, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Eduardo Csar Garrido Merchn (UAM)
• In this talk, I will introduce how to deal with integer-valued variables in Bayesian Optimization with Gaussian Processes. Bayesian optimization (BO) methods are useful for optimizing functions that are expensive to evaluate, lack an analytical expression and whose evaluations can be contaminated by noise. These methods rely on a probabilistic model of the objective function, typically a Gaussian process (GP), upon which an acquisition function is built. This function guides the optimization process and measures the expected utility of performing an evaluation of the objective at a new point. GPs assume continuous input variables. When this is not the case, such as when some of the input variables take integer values, one has to introduce extra approximations. A common approach is to round the suggested variable value to the closest integer before doing the evaluation of the objective. We show that this can lead to problems in the optimization process and describe a more principled approach to account for input variables that are integer-valued. Both synthetic and real experiments show the utility of the approach, which significantly improves the results of standard BO methods on problems involving integer-valued variables. Further work done on this topic will also be presented, describing how to deal with categorical-valued variables and comparing our approach with the popular BO tool SMAC, that works with Random Forests.

## Convex Formulation for Kernel PCA and its Use in Semi-Supervised Learning

• Thursday, October 26, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Carlos M Alaz (UAM)
• In this paper, Kernel PCA is reinterpreted as the solution to a convex optimization problem. Actually, there is a constrained convex problem for each principal component, so that the constraints guarantee that the principal component is indeed a solution, and not a mere saddle point. Although these insights do not imply any algorithmic improvement, they can be used to further understand the method, formulate possible extensions and properly address them. As an example, a new convex optimization problem for semi-supervised classification is proposed, which seems particularly well-suited whenever the number of known labels is small. Our formulation resembles a Least Squares SVM problem with a regularization parameter multiplied by a negative sign, combined with a variational principle for Kernel PCA. Our primal optimization principle for semi-supervised learning is solved in terms of the Lagrange multipliers. Numerical experiments in several classification tasks illustrate the performance of the proposed model in problems with only a few labeled data.

## An Introduction to Variational Inference with Normalizing Flows

• Thursday, October 5, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Daniel Hernández Lobato (UAM)
• The choice of approximate posterior distribution is one of the core problems in variational inference. Most applications of variational inference employ simple families of posterior approximations in order to allow for efficient inference, focusing on mean-field or other simple structured approximations. This restriction has a significant impact on the quality of inferences made using variational methods. In this talk, I will introduce a recent approach for specifying flexible, arbitrarily complex and scalable approximate posterior distributions. The approximations are distributions constructed through a normalizing flow, whereby a simple initial density is transformed into a more complex one by applying a sequence of invertible transformations until a desired level of complexity is attained.

## Auto-adaptive Laplacian Pyramids with Local Resolution

• Thursday, July 6, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dra. Ángela Fernández Pascual (UAM)
• Auto-adaptive Laplacian Pyramids (ALP) is a method for function smoothing and interpolation. ALP is an extension of the standard Laplacian Pyramids model that incorporates a modified Leave One Out Cross Validation LOOCV) procedure that avoids the large cost of standard LOOCV and offers the following advantages: (i) it selects automatically the optimal function resolution (stopping time) adapted to the data and its noise, (ii) it is easy to apply as it does not require parameter selection, (iii) it does not overfit the training set and (iv) it adds no extra cost compared to other classical interpolation methods. We introduce an improve version of this method where the stopping criterion and, hence, the final sigma value for the mixture of Gaussians, will be point-dependent. This property is useful when the data have regions with different behaviors, so the width of the Gaussian should be different in each region. Several experiments show the advantages and particular behavior of this new algorithm and its performance as an interpolation method for real datasets.

## GLASSES: Relieving The Myopia Of Bayesian Optimisation

• Thursday, June 29, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Eduardo César Garrido Merchán (UAM)
• In this talk I will present an approach to deal with Myopia in Bayesian Optimization called GLASSES: Global optimisation with Look-Ahead through Stochastic Simulation and Expected-loss Search. GLASSES, permits the consideration of dozens of evaluations into the future by approximating the ideal look-ahead loss function, which is expensive to evaluate, by a cheaper alternative in which the future steps of the algorithm are simulated beforehand.
• The presentation is based on the work by: González, Javier, Michael Osborne, and Neil Lawrence. "GLASSES: Relieving the myopia of Bayesian optimisation." Artificial Intelligence and Statistics. 2016.

## An Introduction to Adversarial Variational Bayes

• Thursday, June 15, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Daniel Hernández Lobato (UAM)
• In this talk I will present a recently accepted paper for ICML 2017. In this paper it is described a general framework for approximate inference called "Adversarial Variatonal Bayes". This framework allows for using arbitrary complicated distributions in variational inference, a popular technique for approximate Bayesian inference. More precisely, one can consider distributions that lack a closed form expression for the probability density function. Adversarial Variational Bayes also scales to large datasets since it allows for stochastic optimization. It can also be applied to arbitrary complicated models in which the required expectations are intractable. In summary, such a framework has important applications in probabilistic programming languages (in which one only has to describe a probabilistic model and provide the data) since it allows for approximate inference with very little human supervision.

## Alternating Reflections for proximal optimization

• Thursday, May 25, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. lvaro Barbero Jimnez (IIC)
• In the context of regularized learning it is quite common to find optimization problems that can be decomposed as a sum of a convex smooth term + convex non-smooth term. Modern optimization strategies use proximal methods to solve such problems in a modular way, performing minimization by a combination of gradient steps for the smooth term and proximity steps for the non-smooth term. However, when more than one non-smooth term is present, which is common in image processing problems, such approach might result impractical. In this talk I will present how for some particular problems in the form "smooth + non-smooth + non-smooth" an effective method based on alternating reflections can be derived, as opposed to more standard alternating projections methods. This method will make use of concepts both from the fields of proximal methods and submodular optimization, which I will introduce briefly. Then I will show practical applications in background-foreground splitting, image denoising and image deconvolution, where the proposed strategy is superior than stablished and state-of-the-art methods.

## Feature selection in FDA

• Thursday, April 20, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Carlos Ramos Carreo (UAM)
• In this talk I will consider the problem of feature selection with functional data. We will see how feature selection naturally arises in an example of a binary classification problem in Functional Data Analysis (FDA). I will introduce the Maxima Hunting (MH) and Recursive Maxima Hunting (RMH) algorithms for feature selection in FDA, and I will show some of the improvements over the original RMH algorithm that we have developed in the context of my Master's Thesis.

## Feature selection: A principled approach

• Thursday, April 6, 2017, 12:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Alberto Suarez Gonzalez (UAM)
• In this talk we will consider the problem of feature selection from a fundamental perspective. The goal is to provide a unifying framework for the plethora of methods available, and understand their limitations. The presentation is based on material from G Brown, A Pocock, MJ Zhao, M Lujn "Conditional likelihood maximisation: a unifying framework for information theoretic feature selection" Journal of Machine Learning Research 13 (Jan), 27-66 (2012) http://www.jmlr.org/papers/v13/brown12a.html and Francisco Macedo, M. Rosrio Oliveira, Antnio Pacheco, Rui Valadas "A theoretical framework for evaluating forward feature selection methods based on mutual information" https://arxiv.org/abs/1701.07761

## An Introduction to Generative Adversarial Networks

• Thursday, March 23, 2017, 12:00 h. Edif.B, B-351, EPS-UAM.
• Dr. Daniel Hernández Lobato (UAM)
• In this talk I will introduce a framework for estimating generative models via an adversarial process in which two models are simultaneously trained. The first model, G, captures the data distribution. A second model, D, aims at estimating the probability that a sample came from the training data, rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or approximate inference networks during either training or generation of samples. Several experiments demonstrate the potential of such a framework.

## Gaussian Process Kernels for Pattern Discovery and Extrapolation and further work: Using the Spectral Mixture Kernel in Bayesian Optimization.

• Thursday, February 16, 2017, 12:00 h. Edif.B, B-351, EPS-UAM.
• Eduardo César Garrido Merchán (UAM)
• In this talk I will review the paper "Gaussian Process Kernels for Pattern Discovery and Extrapolation" and give a first insight about the application of this idea in Bayesian Optimization. Gaussian process are rich distributions over functions. which provide a Bayesian non-parametric approach to smoothing and interpolation. This paper introduces simple closed form kernels that can be used within Gaussian processes to discover patterns and enable extrapolation. These kernels are derived by modelling a spectral density, the Fourier transform of a kernel, with a Gaussian mixture. The proposed kernels support a broad class of stationary covariances, but Gaussian process inference remains simple and analytic. As the kernel is expressive, it has a high number of parameters. If we take a point estimate of the value of hyperparameters computed by Maximum Likelihood, and we have not a lot of data, as in the initial iterations of Bayesian Optimization, we can incur in overfitting. We will discuss an approach that tries to avoid this issue.
• The presentation is based on the work by: Wilson, Andrew, and Ryan Adams. "Gaussian process kernels for pattern discovery and extrapolation." Proceedings of the 30th International Conference on Machine Learning (ICML-13). 2013.

## Deep Gaussian Processes for Regression using Approximate Expectation Propagation

• Thursday, February 9, 2017, 12:00 h. Edif.B, B-351, EPS-UAM.
• Dr. Daniel Hernández Lobato (UAM)
• Deep Gaussian processes (DGPs) are multi-layer hierarchical generalisations of Gaussian processes (GPs) and are formally equivalent to neural networks with multiple, infinitely wide hidden layers. DGPs are nonparametric probabilistic models and as such are arguably more flexible, have a greater capacity to generalise, and provide better calibrated uncertainty estimates than alternative deep models. This paper develops a new approximate Bayesian learning scheme that enables DGPs to be applied to a range of medium to large scale regression problems for the first time. The new method uses an approximate Expectation Propagation procedure and a novel and efficient extension of the probabilistic backpropagation algorithm for learning. We evaluate the new method for non-linear regression on eleven real-world datasets, showing that it always outperforms GP regression and is almost always better than state-of-the-art deterministic and sampling-based approximate inference methods for Bayesian neural networks. As a by-product, this work provides a comprehensive analysis of six approximate Bayesian methods for training neural networks.

## Application of machine learning methods on ore grade estimation in a copper deposit

• Wednesday, November 30, 2016, 16:00 h. Edif.B , B-351, EPS-UAM.
• Bahram Jafrasteh (Isfahan University of Technology)
• Ore grade estimation is one of the most important stages in evaluating economic viability of an ore deposit. In the paste decades geostatistical methods have been used to predict ore grade values of unknown samples from the measured data. Theses methods often rely on some assumptions about data distribution and the continuity of ore grade values which in the most cases is not satisfied. The machine learning algorithms as alternative approaches are applied to this problem. After defining the problem, the predictive performance of some machine learning algorithms including neural network, random forests and Gaussian processes, in a copper deposit, is compared to ordinary kriging as the traditional geostatistcal method.

## Importance Weighted Autoencoders with Uncertain Neural Network Weights

• Wednesday, November 23, 2016, 16:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Daniel Hernández Lobato (UAM)
• In this talk I will describe the variational autoencoder as a powerful generative model for unsupervised machine learning. Variatonal autoencoders can be used to find a low-dimensional representation of the observed data, in the form of a small set of latent variables. Furthermore, they can also infer the underlying mechanism that generates, from these latent variables, new data instances similar to the observed ones. These models need to perform posterior inference during learning, a task that is carried out by training, in addition to the top-down generative network, a bottom-up recognition network. The training process maximizes a lower bound on the probability of the observed data. This lower bound can be made tighter during training. In this case, the resulting model is called the importance weighted autoencoder. Such a model has better generalization properties than the variational autoencoder. In this talk I will show that it is possible to improve even more the performance of the importance weighted autoencoder by considering random network weights.

## Diffusion Nets

• Wednesday, November 16, 2016, 16:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Adil Omari (Carlos III University)
• Non-linear manifold learning enables high-dimensional data analysis, but requires out-of-sample-extension methods to process new data points. In this paper, we propose a manifold learning algorithm based on deep learning to create an encoder, which maps a high-dimensional dataset and its low-dimensional embedding, and a decoder, which takes the embedded data back to the high-dimensional space. Stacking the encoder and decoder together constructs an autoencoder, which we term a diffusion net, that performs out-of-sample-extension as well as outlier detection. We introduce new neural net constraints for the encoder, which preserves the local geometry of the points, and we prove rates of con-vergence for the encoder. Also, our approach is efficient in both computational complexity and memory requirements, as opposed to previous methods that require storage of all training points in both the high-dimensional and the low-dimensional spaces to calculate the out-of-sample-extension and the pre-image.

## An urn model for majority voting in classification ensembles

• Wednesday, November 2, 2016, 16:00 h. Edif.B , B-351, EPS-UAM.
• Dr. Gonzalo Martínez Muñoz (UAM)
• In this work we analyze the class prediction of parallel randomized ensembles by majority voting as an urn model. For a given test instance, the ensemble can be viewed as an urn of marbles of different colors. A marble represents an individual classifier. Its color represents the class label prediction of the corresponding classifier. The sequential querying of classifiers in the ensemble can be seen as draws without replacement from the urn. An analysis of this classical urn model based on the hypergeometric distribution makes it possible to estimate the confidence on the outcome of majority voting when only a fraction of the individual predictions is known. These estimates can be used to speed up the prediction by the ensemble. Specifically, the aggregation of votes can be halted when the confidence in the final prediction is sufficiently high. If one assumes a uniform prior for the distribution of possible votes the analysis is shown to be equivalent to a previous one based on Dirichlet distributions. The advantage of the current approach is that prior knowledge on the possible vote outcomes can be readily incorporated in a Bayesian framework. We show how incorporating this type of problem-specific knowledge into the statistical analysis of majority voting leads to faster classification by the ensemble and allows us to estimate the expected average speed-up beforehand.

## Scalable Multi-Class Gaussian Process Classification via Expectation Propagation

• Wednesday, October 5, 2016, 16:00 h. Edif.B , B-351, EPS-UAM.
• Carlos Villacampa Calvo (UAM)
• Multi-class Gaussian process classifiers (MGPCs) are a Bayesian approach to non-parametric multi-class classification with the advantage of producing probabilistic outputs that measure uncertainty in the predictions. However, the training cost for this approach is O(n^3), where n is the number of instances. In this talk, we will first make an introduction to Gaussian processes and their use in multi-class classification via the expectation propagation algorithm, followed by an approximation that allows us to use stochastic gradients and achieve a training cost that doesn't depend on the input size.

## Community Detection in Directed Networks

• Tuesday, July 26, 2016, 12:00 h. Sala de Grados A, EPS-UAM
• Dr. Carlos Alaiz (Katolieke Universiteit Leuven)
• Communities in directed networks have often been characterized as regions with a high density of links, or as sets of nodes with certain patterns of connection. Our approach for community detection combines the optimization of a quality function and a spectral clustering of a deformation of the combinatorial Laplacian, the so-called magnetic Laplacian. The eigenfunctions of the magnetic Laplacian, that we call magnetic eigenmaps, incorporate structural information. Hence, using the magnetic eigenmaps, dense communities including directed cycles can be revealed as well as role communities in networks with a running flow, usually discovered thanks to mixture models. Furthermore, in the spirit of the Markov stability method, an approach for studying communities at different energy levels in the network is put forward, based on a quantum mechanical system at finite temperature.

## Data Visualization of Directed Networks

• Wednesday, July 20, 2016, 15:00 h. Aula C-105, Edif. C, EPS-UAM
• Dr. Angela Fernandez Pascual (Katolieke Universiteit Leuven)
• Data visualization is a crucial field for revealing information in a clear and efficient way, being a helpful tool for analyzing data. In this presentation, we will talk about a new method for directed graphs visualization, called Magnetic Eigenmaps, which is based on the analysis of the Magnetic Laplacian, a complex deformation of the well-known combinatorial Laplacian. The main advantage of this method is that it is able to highlight, in a flexible way, groups presented on the network according to the density of links and directionality patterns of the graph, that are revealed through the study of the phases of the first magnetic eigenfunctions.

## Information based approaches for Bayesian Optimization

• June 2016. 12:00 h. Edif.B , B-351, EPS-UAM.
• Eduardo César Garrido Merchán (UAM)
• This presentation shows the basics of Bayesian Optimization and presents PESMOC: Predictive Entropy Search for Multi-Objective Bayesian Optimization with Constraints. An Information-Based approach to tackle Bayesian Optimization problems with multiple objectives and constraints.

## Doctoral course: Bayesian Optimization

• 16-21 December 2015, 11:00-13:00 h, LAB 16, 3rd fl., Bdg. A, EPS-UAM
• Lecturer: Dr. José Miguel Hernández Lobato (Harvard University)
• Registration form

## Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks

• Monday, 21 December 2015, 11:00-13:00 h
• Dr. José Miguel Hernández Lobato (Harvard University)
• Large multilayer neural networks trained with backpropagation have recently achieved state-of-the-art results in a wide range of problems. However, using backprop for neural net learning still has some disadvantages, e.g., having to tune a large number of hyperparameters to the data, lack of calibrated probabilistic predictions, and a tendency to overfit the training data. In principle, the Bayesian approach to learning neural networks does not have these problems. However, existing Bayesian techniques lack scalability to large dataset and network sizes. In this work we present a novel scalable method for learning Bayesian neural networks, called probabilistic backpropagation (PBP). Similar to classical backpropagation, PBP works by computing a forward propagation of probabilities through the network and then doing a backward computation of gradients. A series of experiments on ten real-world datasets show that PBP is significantly faster than other techniques, while offering competitive predictive abilities. Our experiments also show that PBP provides accurate estimates of the posterior variance on the network weights.

## Geometric Intuition and Algorithms for Eν-SVMs

• Jorge López Lázaro (GAA, Machine Learning Group, EPS_UAM)
• Viernes 20 de Noviembre de 2015, 11:00 h, B-351, Escuela Politécnica Superior,Universidad Autónoma de Madrid
• In this work we address the Eν-SVM model proposed by Pérez-Cruz et al. as an extension of the traditional ν-SVMs. Through an enhancement of the range of admissible values for the regularization parameter ν, Eν-SVMs have been shown to be able to produce a wider variety of decision functions, giving rise to a better adaptability to the data. However, while a clear and intuitive geometric interpretation can be given for the ν-SVM model as a Nearest Point Problem in Reduced Convex Hulls (RCH-NPP), no previous work has been made in developing such intuition for the Eν-SVMs. In this paper we show how Eν-SVMs can be reformulated as a geometrical problem that generalizes RCH-NPP, providing new insights into this model. Under this novel point of view, we propose the RAPMINOS algorithm, able to solve Eν-SVMs more efficiently than the current methods. Furthermore, we show how RAPMINOS is able to address the Eν-SVM model for any choice of regularization norm l_p, p >= 1 seamlessly, which further extends Eν-SVM flexibility.

## Solving Constrained Lasso and Elastic Net Using ν-SVMs

• Alberto Torres Barrán (GAA, Machine Learning Group, EPS_UAM)
• Viernes 6 de Noviembre de 2015, 11:00 h, B-351, Escuela Politécnica Superior,Universidad Autónoma de Madrid
• Many important linear sparse models have at its core the Lasso problem, for which the GLMNet algorithm is often considered as the current state of the art. Recently M. Jaggi has observed that Constrained Lasso (CL) can be reduced to an SVM-like problem, for which the LIBSVM library provides very efficient algorithms. This suggests that it could also be used advantageously to solve CL. In this work we will refine Jaggi ’s arguments to reduce CL as well as constrained Elastic Net to a Nearest Point Problem, which in turn can be rewritten as an appropriate ν-SVM problem solvable by LIBSVM. We will also show experi- mentally that the well-known LIBSVM library results in a faster convergence than GLMNet for small problems and also, if properly adapted, for larger ones. Screening is another ingredient to speed up solving Lasso. Shrinking can be seen as the simpler alternative of SVM to screening and we will discuss how it also may in some cases reduce the cost of an SVM-based CL solution.

## Deep Learning Tools

• David Díaz (GAA, Machine Learning Group, EPS_UAM)
• Viernes 30 de Octubre de 2015, 11:00 h, B-351, Escuela Politécnica Superior,Universidad Autónoma de Madrid
• Deep Learning is one of the current hot topics in Machine Learning, but developing your own tool from scratch is a daunting task. We will review the most popular frameworks so you can choose the one that better covers your needs while letting you take advantage of every FLOP modern scientific computing hardware can offer.

## A Fisher consistent multiclass loss function with variable margin on positive examples

• Irene Rodríguez (GAA, Machine Learning Group, EPS_UAM)
• Viernes 23 de Octubre de 2015, 11:00 h, B-351, Escuela Politécnica Superior,Universidad Autónoma de Madrid
• The concept of pointwise Fisher consistency (or classification calibration) states necessary and sufficient conditions to have Bayes consistency when a classifier minimizes a surrogate loss function instead of the 0-1 loss. We present a family of multiclass hinge loss functions defined by a continuous control parameter λ representing the margin of the positive points of a given class. The parameter λ allows shifting from classification uncalibrated to classification calibrated loss functions. Though previous results suggest that increasing the margin of positive points has positive effects on the classification model, other approaches have failed to give increasing weight to the positive examples without losing the classification calibration property. Our λ-based loss function can give unlimited weight to the positive examples without breaking the classification calibration property. Moreover, when embedding these loss functions into the Support Vector Machine's framework (λ-SVM), the parameter λ defines different regions for the Karush−Kuhn−Tucker conditions. A large margin on positive points also facilitates faster convergence of the Sequential Minimal Optimization algorithm, leading to lower training times than other classification calibrated methods. λ-SVM allows easy implementation, and its practical use in different datasets not only supports our theoretical analysis, but also provides good classification performance and fast training times.

## Homogeneity and independence tests based on RKHS embeddings II

• Alberto Suárez (GAA, Machine Learning Group, EPS_UAM)
• Viernes 16 de Octubre de 2015, 11:00 h, B-351, Escuela Politécnica Superior,Universidad Autónoma de Madrid
• In this seminar we analyze a class of statistical tests based on embeddings of probability distributions in Reproducing Kernel Hilbert Spaces. These types of tests will be related to another family of powerful homogeneity tests based on the concept of energy distance between distributions.

## Homogeneity and independence tests based on RKHS embeddings I

• Alberto Suárez (GAA, Machine Learning Group, EPS_UAM)
• Viernes 9 de Octubre de 2015, 10:00 h, B-351, Escuela Politécnica Superior,Universidad Autónoma de Madrid
• In this seminar we analyze a class of statistical tests based on embeddings of probability distributions in Reproducing Kernel Hilbert Spaces. These types of tests will be related to another family of powerful homogeneity tests based on the concept of energy distance between distributions.

## Shaping Social Activity by Incentivizing Users

• Manuel Gómez Rodríguez (Max Planck Institute for Software Systems)
• Jueves 20 de Noviembre 2014, 12:00 h, Sala de Grados, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• Events in an online social network can be categorized roughly into endogenous events, where users just respond to the actions of their neighbors within the network, or exogenous events, where users take actions due to drives external to the network. How much external drive should be provided to each user, such that the network activity can be steered towards a target state? In this paper, we model social events using multivariate Hawkes processes, which can capture both endogenous and exogenous event intensities, and derive a time dependent linear relation between the intensity of exogenous events and the overall network activity. Exploiting this connection, we develop a convex optimization framework for determining the required level of external drive in order for the network to reach a desired activity level. We experimented with event data gathered from Twitter, and show that our method can steer the activity of the network more accurately than alternatives.

## Three reasons why control is hard: learning, planning and representingg

• Bert Kappen (Radboud University Nijmegen)
• Miércoles 26 de Febrero 2014, 11:00 h, Sala de Grados, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• Intelligent systems, whether natural or artificial, must act in a world that is highly unpredictable. It is intuitively clear, that an optimal approach to decision making or planning under such circumstances requires to take these uncertainties into account. However, the optimal control solution is intractable to compute in general and in addition is hard to represent, due the non-trivial state dependence of the optimal control. This has prevented large scale application of stochastic optimal control theory so far. The path integral control theory describes a class of control problems whose solution can be computed as an inference computation in a graphical model and thus provides an integrated Bayesian viewpoint. In this talk, I will show how the theory naturally arises in the context of information constraints and the large deviation principle, using an argument originally due to Erwin Schrödinger. I will then present a new result that shows how a feed-back control law can be computed efficiently within the path integral framework for continuous stochastic control problems.

## Practical Implications of Classification Calibration

• Irene Rodríguez Luján (Biocircuits Institute. University of California San Diego)
• Lunes 13 de Enero 2014, 12:00 h, Sala de Grados, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• The importance of classification calibration falls on its relationship with the Bayes consistency. Most research in the literature has analyzed classification calibration under a mathematical framework that implicitly assumes that the decision functions can be separately defined for each point. Therefore, this framework generally overlooks the practical consequences of using a specific classification model based on these loss functions. Therefore, the goal of this seminar is to give practical point of view of the classification calibration concept by trying to address the following questions: Is it possible to have a continuous family of loss functions in such a way that we can easily control their classification calibration properties? Are the classification calibration requirements feasible when such set of functions is used with a parametric classifier? In other words, does the classifier inherit the classification calibration properties of the loss function? We propose a continuous family of loss functions defined by a control parameter that allows us to shift the loss function from classification uncalibrated to calibrated. We characterize the decision functions that make a loss function classification calibrated to determine whether these decision functions are achievable in parametric classifiers. As an example, we embed this continuous family of loss functions in a multiclass Support Vector Machine (SVM) to analyze SVM's solutions as a function of the control parameter, obtaining as byproduct a new classification model when the control parameter tends to infinity. Our experiments on multiclass problems show similar classification accuracy for classification calibrated and uncalibrated loss functions, and they point to the classifier with the control parameter tending to infinity as a promising model in terms of classification accuracy and training time.

## Training nested functions using auxiliary coordinates

• Miguel Á. Carreira-Perpiñán (University of California, Merced)
• Miércoles 8 de Enero de 2014, 12:00 h, Sala de Grados, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• Many models in machine learning, computer vision or speech processing have the form of a sequence of nested, parameterized functions, such as a multilayer neural net, an object recognition pipeline, or a "wrapper" for feature selection. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable (so computing gradients with the chain rule does not apply), and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations. If time permits, I will illustrate how to use MAC to derive training algorithms for a range of problems, such as deep nets, best-subset feature selection, joint dictionary and classifier learning, supervised dimensionality reduction, and others. This is joint work with Weiran Wang.

## Revealing multi-scale data patterns with hierarchical diffusion maps modeling

• Martes 17 de Septiembre 2013, 16:00 h, B-351, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• Neta Rabin (Afeka Tel-Aviv Academic College of Engineering)
• Analysis of high-dimensional data driven processes is an important task, which appears in many domains and applications. A central challenge is how to represent and fuse the large amounts of data, which are gathered from different sensors and are characterized by patterns in several scales. Common ways of representing such high dimensional data is via application of dimensionality reduction techniques, for example dynamic Principal Component Analysis. When the high-dimensional data includes non-linear structures, such methods fail to find a faithful low-dimensional representation. In this talk, we propose a framework that is based on manifold learning techniques in order to find a low dimensional representation that uses the geometric structure of the gathered data in several scales. We present an additional method for extending the nonlinear model to new data points.The proposed methods are demonstrated on several different dynamical systems such as a transaction based system and for seismic discrimination.

## An Introduction to Sum Product Networks

• Viernes 5 de Abril 2013, 15:00-17:00 h, Sala de Grados A, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• José Miguel Hernández-Lobato (University of Cambridge)
• Sum product networks (SPNs) are a new family of deep probabilistic models in which exact inference is tractable. SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. A SPN is an arithmetic circuit which under some conditions (completeness and consistency) represents the partition function and all marginals of some graphical model. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also more general. Discriminative and generative learning of SPNs can be efficiently implemented using hard EM and hard gradient descent. These methods avoid the problem of gradient diffusion in deep architectures and allow us to effectively work with SPNs of more than 30 layers of hidden variables. Several experiments show that SPNs have state of the art performance on different image completion and classification tasks, outperforming alternative deep and shallow methods.

## Learning by motor babbling: a stochastic optimal control approach

• Miércoles 3 de Abril 2013, 12:00 h, Sala de Grados A, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• Bert Kappen (University of Nijmegen))
• In this talk, we address the problem of how one can control a plant without knowing a model of the plant. We show how for a certain class of control problems the solution is provided by a sampling procedure. The statistics obtained in this way are sufficient to compute the optimal control. We apply the method to a robotics task and a helicopter coordination task.

## Stochastic Variational Inference for Modeling Binary Matrices

• Martes 2 de Abril 2013, 12:00 h, Sala de Grados A, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• José Miguel Hernández-Lobato (University of Cambridge)
• Stochastic variational inference (SVI) is a recent method for performing approximate Bayesian inference in massive datasets using stochastic optimization techniques. In this talk I will give a brief introduction to the general SVI method and then show how it can be applied to probabilistic models that are used for describing binary matrices. In these models, the sampling strategy used by the SVI method turns out to have a significant impact in its convergence speed. Our results show that a SVI method with non-uniform sampling distribution usually outperforms the standard SVI method and alternative batch methods.

## NetBox: A Probabilistic Method for Modeling and Analyzing Market Basket Data

• Miércoles 9 de Enero 2013, 12:00 h, B405, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• José Miguel Hernández-Lobato (University of Cambridge)
• We propose a technique for extracting meaningful patterns from market basket data (e.g. clicks or purchases). Association rule mining is a popular method for this task. However, when applied to benchmark datasets, this method usually a) generates a large number of rules which are difficult to interpret, b) requires to non-trivially select \emph{support} and \emph{confidence} parameters and c) can be outperformed by other methods when used for making predictions. To address these difficulties we present NetBox, a probabilistic method for modeling implicit feedback data. Instead of rules, NetBox generates a network of items in which related items are connected to each other. These networks are visually attractive and easy to interpret. Netbox follows a Bayesian approach and does not require the user to specify any hyper-parameter value. Finally, several experiments show that NetBox generates very accurate predictions, obtaining results which are better or competitive with those of alternative state-of-the-art methods at a much lower computational cost.

## Implicaciones Parciales y su Aplicación a la Minería de Datos

• Lunes 10 de diciembre de 2012, 16:00 h, Sala de Grados, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• José Luis Balcázar (Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya)
• Hoy en día existen multitud de problemas donde se disponen de secuencias multidimensionales con estructura temporal. El problema consiste en determinar la causa, el origen o la fuente de esas fluctuaciones temporales que se manifiestan en sensores multidimensionales. Típicamente las series temporales se descomponen en un conjunto de características o “features” que posteriormente se utilizan como entradas de algoritmos de clasificación clásicos, como pueden ser redes neuronales artificiales o máquinas de soporte vectorial. Si bien estos algoritmos son capaces de extraer unos niveles más que aceptables de rendimiento en la identificación y discriminación de señales, es obvio que se pueden mejorar los resultados si se tiene en consideración que las series temporales tienen una estructura temporal. ¿Qué quiere decir que tengan estructura temporal? Que dos medidas próximas en el tiempo tienen alta correlación entre ellas. Curiosamente esta propiedad tan básica de las series temporales ha sido ignorada en los campos de Inteligencia Artificial y Estadística por dos suposiciones dominantes: la estacionalidad de la señal y, en menor medida, la independencia estadística de las medidas. El objetivo de este seminario es explicar diferentes metodologías actuales que se utilizan para la clasificación y discriminación de series temporales, y en concreto mostrar cómo se pueden utilizar sistemas dinámicos junto con las máquinas de vectores soporte para la resolución de problemas de clasificación de series temporales.

## Aplicaciones de los algoritmos para la clasificación de series temporales

• Ramón Huerta (Universidad de California San Diego)
• Hoy en día existen multitud de problemas donde se disponen de secuencias multidimensionales con estructura temporal. El problema consiste en determinar la causa, el origen o la fuente de esas fluctuaciones temporales que se manifiestan en sensores multidimensionales. Típicamente las series temporales se descomponen en un conjunto de características o “features” que posteriormente se utilizan como entradas de algoritmos de clasificación clásicos, como pueden ser redes neuronales artificiales o máquinas de soporte vectorial. Si bien estos algoritmos son capaces de extraer unos niveles más que aceptables de rendimiento en la identificación y discriminación de señales, es obvio que se pueden mejorar los resultados si se tiene en consideración que las series temporales tienen una estructura temporal. ¿Qué quiere decir que tengan estructura temporal? Que dos medidas próximas en el tiempo tienen alta correlación entre ellas. Curiosamente esta propiedad tan básica de las series temporales ha sido ignorada en los campos de Inteligencia Artificial y Estadística por dos suposiciones dominantes: la estacionalidad de la señal y, en menor medida, la independencia estadística de las medidas. El objetivo de este seminario es explicar diferentes metodologías actuales que se utilizan para la clasificación y discriminación de series temporales, y en concreto mostrar cómo se pueden utilizar sistemas dinámicos junto con las máquinas de vectores soporte para la resolución de problemas de clasificación de series temporales.

## Feature Selection and Applications to Genomic Data Analysis

• Pierre Dupont (Université Catholique de Louvain)
• This doctoral course studies feature selection methods with a special focus on supervised classification of high dimensional data. Feature selection consists in finding, among a set of input variables or covariates, the most relevant ones for a prediction task or, simply, for summarizing a dataset to its key features. Feature selection aims at controlling the curse of dimensionality, at interpreting a predictive model and reducing its computational complexity. The course covers the methodology of feature selection and practical applications to the analysis of genomic data. Basic notions of probability and statistics is assumed. A familiarity with standard machine learning algorithms for supervised classification (SVMs, kNN, Decision Trees and Random Forests, ...) would be helpful. No prior knowledge in molecular biology is required.
• Monday June 11, 15:00 - 18:00
• Introduction to feature selection
• Motivating examples in text classification and genomic data analysis
• Filter methods: correlation, t-test, information theoretic measures, multivariate filter
• Performances metrics: accuracy, balanced classification rates, stability indices
• Tuesday June 12, 15:00 - 18:00
• Wrapper methods
• Embedded methods
• feature selection embedded into linear models
• ensemble methods
• sparsity-inducing regularization
• Wednesday June 13, 15:00 - 17:00
• Unsupervised feature selection
• Semi-supervised feature selection
• Partially supervised feature selection
• Transfer learning for feature selection

## Stable L1-Norm Regularization for High-Dimensional Feature Selection

• Pierre Dupont (Université Catholique de Louvain)
• Feature selection aims at finding the most important variables or input features for a given prediction task. Such a selection improves the interpretability of classification or regression models, while reducing their computational cost when predicting from new observations. It also offers a way to control the risk of over-fitting in high dimensional setting when the number of original features is typically orders of magnitude larger than the number of observations. In this talk, we first review the popular LASSO approach relying on an L1-norm regularization. Such a regularization performs an automatic feature selection while estimating (generalized) linear models by driving most model coefficients towards zero. While beneficial for reducing the dimensionality of the prediction problem, we stress that models estimated with a LASSO penalty are also known to be highly unstable. In other words, small data perturbations may imply drastic changes in the subset of automatically selected features. Next we discuss common alternatives to the original LASSO. Elastic Net tends to offer more stable but also less sparse models, while Group LASSO requires a priori defined groups of features to constrain the learning procedure. Ensemble methods relying on a bootstrap mechanism also offer more stable models but sometimes with a reduced interpretability. We further present Trace LASSO which has been recently introduced. Its regularization can be viewed as a specific combination of L1 and L2 norms depending on the observed correlations between features. The key benefit is that Trace LASSO mixes both norms but in an adaptive way, depending on the design matrix, rather than considering an additional meta-parameter as in Elastic Net. The core of the talk presents the SPO method a novel learning algorithm to stabilize LASSO type models. This algorithm is a proximal optimization method which iteratively seeks for a solution in a neighborhood rescaled according to the variances of the predictor variables. Classification experiments conducted on several microarray datasets show the benefits of the SPO method, both in terms of stability and predictive performances, as compared to the original LASSO, Elastic Net and Trace LASSO.

## An introduction to Deep Learning

• 18 May 2012, 16:00 h, B-351, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• Álvaro Barbero (Instituto de Ingeniería del Conocimiento)
• During the last years a growing interest has arised within the machine learning community regarding the topic of deep architectures. In its most basic form, a deep model is a neural network with several layers of hidden units. While such models have been known since the 80's, practical experience says that the use of deep architectures does not provide significant benefits over networks with a single hidden layer. However, recent advances in the field have shown that deep networks can indeed overcome shallow networks if proper learning algorithms are used. In this talk I will introduce the challenging problem of learning in deep networks, which cannot be effectively solved through classic neural network training methods such as backpropagation. After identifying the causes of this hardness, I will present two popular methods to conduct the learning of deep structures, Restricted Boltzmann Machines and Sparse Autoencoders, together with some optimization techniques fit for the optimization of this kind of networks. The applicability of these techniques to more complex tasks such as semisupervised learning, denoising and missing values imputation will also be discussed.

## Hierarchical Linear Support Vector Machine

• 4 May 2012, 16:00 h, B-351, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• The increasing size and dimensionality of real-world datasets make it necessary to design efficient algorithms not only in the training process but also in the prediction phase. In applications such as credit card fraud detection, the classifier needs to predict an event in 10 milliseconds at most. In these environments the speed of the prediction constraints heavily outweighs the training costs. We propose a new classification method, called a Hierarchical Linear Support Vector Machine (H-LSVM), based on the construction of an oblique decision tree in which the node split is obtained as a Linear Support Vector Machine. Although other methods have been proposed to break the data space down in subregions to speed up Support Vector Machines, the H-LSVM algorithm represents a very simple and efficient model in training but mainly in prediction for large-scale datasets. Only a few hyperplanes need to be evaluated in the prediction step, no kernel computation is required and the tree structure makes parallelization possible. In experiments with medium and large datasets, the H-LSVM reduces the prediction cost considerably while achieving classification results closer to the non-linear SVM than that of the linear case.

## Revealing structures in the output space with Output Kernel Learning

• 20 April 2012, 16:00 h, B-351, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• Francesco Dinuzzo (Max Planck Institute for Intelligent Systems)
• Machine learning problems with multiple and structured outputs can be solved effectively only if the relationships between the output components are properly modeled. The framework of kernel methods allows to embed prior knowledge about the relationships between the different output components by designing suitable kernel functions on both the input and the output set. However, in many cases the available prior knowledge is not sufficient to design a good kernel in advance. In this talk, we discuss the possibility of learning simultaneously a vector-valued function and a kernel for the output space, by solving suitable non-convex optimization problems.

## Optimization for large scale regularization problems

• Francesco Dinuzzo (Max Planck Institute for Intelligent Systems)
• Regularization techniques find their application in a variety of domains, ranging from machine learning, signal processing, inverse problems, and dynamic system identification. A great deal of effort has been devoted in recent years to making them practical tools for solving large-scale problems. In this lecture, we will discuss two methodologies for tackling large scale regularization problems. The first is a class of operator splitting techniques based on the theory of monotone operators (which we will briefly review). The second is based on coordinate descent optimization which, despite his simplicity, turns out to be quite effective in a variety of circumstances. We will discuss the derivation, convergence properties, and some practical implementation details.

## Analysis and convergence of SMO-like decomposition and geometrical algorithms for support vector machines

• Jorge López Lázaro (Instituto de Ingeniería del Conocimiento)
• Support Vector Machines (SVMs) constitute one of the most successful paradigms in Machine Learning nowadays. Their success stems from the fact that they are relatively simple models, with excellent generalization properties for classification and regression, that arise from solving convex optimization problems. In addition, the models are interpretable in terms of the so-called Support Vectors, which are the points that influence in the final models. These models have the form of a hyperplane, so SVMs are a variety of linear models. Even though linear models are in principle of limited use, the power behind SVMs comes from the use of kernel functions, which effectively build non--linear models after arriving to a linear model in a projected feature space. The success of SVMs is also indicated by the formulation over the years of numerous variations building on them. Among these, ?-SVMs and Least Squares SVMs (LS-SVMs) have been especially relevant. A parallel line of research investigates the geometrical formulation of SVMs as Nearest Point Problems. Although the optimization problems giving rise to SVMs have a simple structure, it is not trivial to solve efficiently these tasks. The main problem comes from the size of the kernel matrix, which is the square of the number of training patterns. This precludes the use of standard optimization routines, and requires the conception of ad-hoc methods. Perhaps the simplest method of all is Sequential Minimal Optimization (SMO). Despite its simplicity, some variations from the original algorithm, termed jointly as SMO-like'' methods, can be considered as the state-of-the-art in SVM training. In this thesis, after motivating theoretically SVMs and the SMO algorithm, we formulate a general problem that encompasses all the specific formulations enumerated above. The SMO algorithm can be adapted to this general problem after minor changes, which also includes as particular cases the SMO variants for the different formulations. Moreover, we give a new and simple proof of the convergence of this general SMO version to the optimal solution.

## Mining the patterns and profiles of human mobility

• Dra. Fosca Giannotti, Director of research at the Information Science and Technology Institute of the National Research Council, ISTI-CNR, Pisa, Italy.
• The wireless networks that surround us, as a by-product of their normal operations, allow for sensing and collecting massive repositories of spatio-temporal data, such as the call detail records from mobile phones and the GPS tracks from car navigation devices, which represent society-wide proxies of human mobile activities. These big mobility data provide an unprecedented powerful social microscope, which helps us understand human mobility, and discover the hidden patterns and profiles that characterize the trajectories we follow during our daily activity. We illustrate the basic methods of mobility data mining, designed to extract from the big mobility data the patterns of collective movement behavior, i.e., discover the subgroups of travelers characterized by a common purpose, and the profiles of individual movement activity, i.e., characterize the routine mobility of each traveller. We also present how mobility data mining can be combined with complex network analysis to address fascinating new questions, such as how to discover the geographical borders that emerge from the network of flows between any two zones in a territory, and how to measure to what extent the mobility patterns shape and impact the social networks we inhabit.

## The variational garrote

• 12 January 2012, 11 h, B-351, Escuela Politécnica Superior, Universidad Autónoma de Madrid
• H.J. Kappen. Radboud University, Nijmegen, the Netherlands.
• In this talk, I present a new model and solution method for sparse regression. The model introduces binary selector variables $s_i$ for the features $i$ in a way that is similar to the original garrote model. The posterior probability for $s_i$ is computed in the variational approximation. I refer to this method as the Variational Garrote (VG). The VG is compared numerically with the Lasso method and with ridge regression. Numerical results on synthetic data show that the VG yields more accurate predictions and more accurately reconstructs the true model than the other methods. The naive implementation of the VG requires the inversion of a modified covariance matrix which scales cubic in the number of features. We indicate how for sparse problem the solution can be computed linear in the number of features.

## Causal discovery: beyond faithfulness

• Dr. Joris Mooij. Machine Learning Group, Intelligent Systems. Institute for Computing and Information Sciences (iCIS). Radboud University. Nijmegen
• Discovering causal relationships from purely observational data was considered to be impossible for a long time. Nevertheless, during the last decades, it was shown that under relatively weak assumptions, causal relationships can be deduced from conditional (in)dependences between observed variables. Even though these methods are fairly general, they have important limitations. Indeed, in many cases they do not give a unique causal model that explains the observed data, but instead they give a whole set of possible causal models (even asymptotically, as the number of samples tends to infinity). This limitation is most obvious in the bivariate case: even excluding the possibility of hidden common causes, using only conditional independences between observed variables, one cannot distinguish between X causing Y and Y causing X. Another more practical limitation is that reliably testing for conditional independence requires many samples. In recent years, several new methods for causal discovery from observational have been developed. In this talk I will present some of these novel methods, explain the underlying assumptions they make, discuss how these are different from the usual assumptions (in particular, the faithfulness assumption), and show the potential of these novel methods.

• Monday, 19th September 2011, 11:30, Salón de Grados, EPS-UAM.
• Los algoritmos de clasificación constituidos mediante técnicas cooperativas tipo boosting han demostrado excelentes prestaciones y una singular pero no absoluta resistencia a la nociva tendencia al sobreajuste. Aunque atribuida dicha resistencia a la forma de las funciones de coste empleadas en las versiones originales de estos algoritmos, resulta más verosímil la explicación de que su construcción progresiva y el carácter débil de los aprendices que integran los diseños son las causas reales de esa ventaja. La aparición de sobreajuste, sin embargo, es posible en problemas con un número relevante de muestras imposibles de clasificar correctamente. Para reducir esa tendencia, se ha propuesto un amplio número de modificaciones que moderen la influencia de dichas muestras en el progresivo enfatizado que se emplea para construir sucesivos aprendices. Pero es posible otra aproximación: dado que las muestras imposibles reciben excesiva atención, una fusión diferente de la lineal podría limitar los efectos negativos de tal exceso. En esta conferencia se examinará la posibilidad de incorporar una puerta al proceso de fusión de los aprendices de un conjunto Real AdaBoost, discutiendo las condiciones convenientes para proceder así y evaluando los resultados de diversas opciones en una serie de aplicaciones tradicionales. Tras presentar las conclusiones del trabajo, se revisarán posibles líneas de mejora y ampliación.
• Diapositivas
• Artículo

## Defensa de Tesis: Efficient Optimization Methods for Regularized Learning: Support Vector Machines and Total-Variation Regularization.

• Friday, 08th July 2011, 12:00, Salón de Grados, EPS-UAM.
• In the context of machine learning methods, regularization has become an established practice to control overfitting in the modeling process and to induce structure into the resultant models. At the same time, the flexibility of the regularization framework has provided a common point of view embracing classical and established learning models, as well as recent proposals in the topic. This richness comes from its appealing simplicity, which casts the learning process into a composite optimization problem formed by a loss function and a regularizer; different models are obtained through the selection of appropriate loss and regularizer functions. This elegant modularity, however, does not come at no cost, as an adequate optimization algorithm must be applied or devised in order to solve the resultant problem. While general purpose solvers are directly applicable out--of--the--box in some settings, they usually produce poor results in terms of efficiency and scalability. Further, in more complex models featuring non-smooth or even non-convex loss or regularizer functions, such approaches easily become inapplicable. Consequently, the design of a adequate optimization methods becomes a key task for the success of a regularized learning process. In this thesis two particular cases of regularization are studied in depth. On the one hand, the well established and successful Support Vector Machine model is presented in its different forms. A careful observation at the current algorithmic solutions to this problem shows that correcting hidden deficiencies and making a better use of the gathered information can lead to significant improvements in running times, surpassing state of the art methods. On the other hand, a class of sparsity-inducing regularizers known as Total--Variation is studied, with wide application in the fields of signal and image processing. While a variety of approaches have been applied to solve this class of problems, it is shown here that by taking advantage of their strong structural properties and adapting suitable optimization algorithms, relevant improvements in efficiency and scalability can be obtained as well. Software implementing the developed methods is also made available as part of this thesis.

## Adaptive Learning in a World of Projections.

• Monday, 27th June 2011, 12:00-13:30, Salón de Grados, EPS-UAM.
• Sergios Theodoridis (Dept. of Informatics and Telecommunications, National and Kapodistrian University of Athens, Greece).
• The task of parameter/function estimation has been at the center of scientific attention for a long time and it comes under different names such as filtering, prediction, beamforming, classification, regression. Conventionally, the task has been treated as an optimization task of an appropriately adopted loss function. However, in most of the cases, the choice of the loss function is mainly dictated by its mathematically tractability and not by a physical reasoning related to the specific problem at hand. The task is further complicated when a-priori information, in the form of constraints, becomes available. The presence of constraints in estimation tasks is recently gaining in importance, due to the revival of interest in robust learning schemes. In this talk, the estimation task is treated in the context of set theoretic estimation arguments. Instead of a single optimal point, we are searching for a set of solutions that are in agreement with the available information, which is provided to us in the form of a set of training points and a set of constraints. The goal of this talk is to present a general tool for parameter/function estimation, both for classification as well as regression tasks, in a time adaptive setting in (infinite dimensional) Reproducing Kernel Hilbert spaces (RKHS). The general framework is that of convex set theory via the powerful and elegant tool of projections.
• Slides

## Redes Neuronales con Pesos Funcionales Generados Mediante Puerta (GG-FWNN).

• Friday, 17th June 2011, 12:00, Salón de Grados, EPS-UAM.
• Aníbal R. Figueiras Vidal (CU, DTSC, Universidad Carlos III de Madrid).
• Las Mezclas de Expertos (MoE) constituyen una familia de conjuntos de máquinas de aprendizaje de alto interés conceptual, pero de trabajoso aprendizaje y limitadas prestaciones para problemas de clasificación. Una simple reordenación de las expresiones analíticas de las MoE y una conveniente elección de la arquitectura de la puerta (tipo Red de Funciones Base Radiales) conducen a esquemas del tipo Perceptrón Monocapa con pesos funcionales, que pueden entrenarse directamente con algoritmos de Máximo Margen, y con prestaciones comprobadamente competitivas. Denominamos GG-FWNN ("Gate Generated Funtional Weights Neural Networks") a los correspondientes diseños. En esta conferencia se deriva el modelo básico de GG-FWNN, y se examinan las prestaciones de varias versiones sencillas comparándolas con las de máquinas SVM y conjuntos Real AdaBoost. La exposición concluirá presentando las líneas de I+D de aquí nacidas que se exploran actualmente.
• Slides

## Seminario Previo de Tesis: Efficient Optimization Methods for Regularized Learning: Support Vector Machines and Total-Variation Regularization.

• Friday, 29th April 2011, 12:00, Salón de Grados, EPS-UAM.
• In the context of machine learning methods, regularization has become an established practice to control overfitting in the modeling process and to induce structure into the resultant models. At the same time, the flexibility of the regularization framework has provided a common point of view embracing classical and established learning models, as well as recent proposals in the topic. This richness comes from its appealing simplicity, which casts the learning process into a composite optimization problem formed by a loss function and a regularizer; different models are obtained through the selection of appropriate loss and regularizer functions. This elegant modularity, however, does not come at no cost, as an adequate optimization algorithm must be applied or devised in order to solve the resultant problem. While general purpose solvers are directly applicable out--of--the--box in some settings, they usually produce poor results in terms of efficiency and scalability. Further, in more complex models featuring non-smooth or even non-convex loss or regularizer functions, such approaches easily become inapplicable. Consequently, the design of a adequate optimization methods becomes a key task for the success of a regularized learning process. In this thesis two particular cases of regularization are studied in depth. On the one hand, the well established and successful Support Vector Machine model is presented in its different forms. A careful observation at the current algorithmic solutions to this problem shows that correcting hidden deficiencies and making a better use of the gathered information can lead to significant improvements in running times, surpassing state of the art methods. On the other hand, a class of sparsity-inducing regularizers known as Total--Variation is studied, with wide application in the fields of signal and image processing. While a variety of approaches have been applied to solve this class of problems, it is shown here that by taking advantage of their strong structural properties and adapting suitable optimization algorithms, relevant improvements in efficiency and scalability can be obtained as well. Software implementing the developed methods is also made available as part of this thesis.

## Least Squares-Support Vector Machines in Supervised and Unsupervised Learning.

• Tuesday, 12th April 2011, 11:30-13:30, Sala de Grados, EPS-UAM.
• Johan Suykens (K.U. Leuven Belgium, ESAT-SCD).
• Methods of support vector machines and kernel-based learning have been successful on a wide range of applications especially for problems with high dimensional inputs. Different methodologies have emerged with use of optimization-based settings, estimation in reproducing kernel Hilbert spaces and probabilistic approaches. In this tutorial we first explain some basics of support vector machines. Next we discuss how a wide range of problems in supervised and unsupervised learning can be understood in terms of least-squares support vector machines. The emphasis in this case is on models that possess primal and (Lagrange) dual model representations, using feature maps and positive definite kernels, respectively. In the last part we focus on new directions in unsupervised learning, in particular kernel spectral clustering. We will discuss new approaches for making out-of-sample extensions and new model selection procedures, with the possibility to incorporate prior knowledge and handling large data sets.
• Slides

## Introductory Lecture on Large-Scale Optimization for Machine Learning.

• Monday, 04th April 2011, 11:30-13:30, Aula 5, EPS-UAM.
• Suvrit Sra (Max Planck Institute for Intelligent Systems, Tübingen, Germany).
• Modern fields such as bioinformatics, computational statistics, and machine learning owe a large part of their success to the mature discipline of optimization. Although optimization techniques proposed twenty years and more ago continue to be widely used and refined, the sheer complexity, size, and variety of models encountered in modern problems is forcing us to question, if not abandon, existing assumptions and techniques. A key realization that has emerged is that sophisticated algorithms must be replaced by simpler ones, even at the cost of weaker theoretical guarantees. This viewpoint suggests that it is valuable for computer science students to have at least an introductory acquaintance with the techniques of large-scale optimization. In my lecture, I will introduce the basic techniques: highlighting their strengths and mentioning their limitations. The topics will range from gradient methods, their complexity analysis, accelerated (optimal) gradient methods, online, incremental and stochastic methods, and regularized nonsmooth optimization. Throughout, I will mention examples from ML to make the presentation concrete. Time permitting, I will briefly discuss higher-order methods (Newton-type) that can sometimes also prove to be highly effective, without giving up scalability.

## Diffusion Maps.

• Monday, 28th March 2011, 16:00, B-351, EPS-UAM.
• Dimensional reduction and clustering are important issues in machine learning. Classical methods do not always provide good results as they do not take into account the explicit form of the differential manifold structure in which our data probably lie. The objective of this seminar is to present new methods in these disciplines, specifically Diffusion Maps algorithms. These techniques allow us to create an embedding based on the local geometric information of the original data, as the well known Spectral Clustering algorithms do. Moreover, Diffusion Maps present a solid theoretical justification and versatile results.
• Slides

## Applying Robust Optimization to Binary Classification.

• Thursday, 10th March 2011, 16:00, Sala de Grados, EPS-UAM.
• Akiko Takeda (Keio University, Japan).
• Robust optimization is one of approaches to handle optimization problems defined by uncertain inputs. Some robust optimization approaches have been applied to binary classification where the training samples are regarded as uncertain inputs. In this talk I will introduce a robust optimization methodology and refer to the work of Xu et al. (2009) which shows that the robust optimization plays the same role with regularization. Finally, I will show that machine learning algorithms can be unified into a general framework with the use of robust optimization techniques.

## On the Equivalence of Kernel Fisher Discriminant Analysis and Kernel Quadratic Programming Feature Selection.

• Friday, 25th February 2011, 16:00, B-351, EPS-UAM.
• We reformulate the Quadratic Programming Feature Selection (QPFS) method in a kernel space to obtain a vector which maximizes the quadratic objective function of QPFS. We demonstrate that the vector obtained by Kernel Quadratic Programming Feature Selection is equivalent to the Kernel Fisher vector and, therefore, a new interpretation of the Kernel Fisher Discriminant Analysis is given which provides some computational advantages for highly unbalanced datasets.
• Slides

## New Hybrid Monte Carlo Methods for Efficient Sampling: from Physics to Biology and Statistics.

• Monday, 17th January 2011, 16:00, B-351, EPS-UAM.
• Elena Akhmatskaya.
• A class of novel hybrid methods for detailed simulations of large complex systems in physics, biology, materials science and statistics is introduced. These generalized shadow Hybrid Monte Carlo (GSHMC) methods combine the advantages of stochastic and deterministic simulation techniques. They utilize a partial momentum update to retain some of the dynamical information, employ modified Hamiltonians to overcome exponential performance degradation with the system's size and make use of multi-scale nature of complex systems. Variants of GSHMCs were developed for atomistic simulation, particle simulation and statistics: GSHMC (thermodynamically consistent implementation of constant-temperature molecular dynamics), MTS-GSHMC (multiple-time-stepping GSHMC), meso-GSHMC (Metropolis corrected dissipative particle dynamics (DPD) method), and a generalized shadow Hamiltonian Monte Carlo, GSHmMC, (a GSHMC for statistical simulations). All of these are compatible with other enhanced sampling techniques and suitable for massively parallel computing allowing for a range of multi-level parallel strategies. A brief description of the GSHMC approach, examples of its application on high performance computers and comparison with other existing techniques are given. Our approach is shown to resolve such problems as resonance instabilities of the MTS methods and non-preservation of thermodynamic equilibrium properties in DPD, and to outperform known methods in sampling efficiency by an order of magnitude.

## Biomarker Selection from Microarray Data: a Transfer Learning Approach.

CANCELLED

• Friday, 17th December 2010, 16:30, B-351, EPS-UAM.
• Pierre Dupont.
• Classification of microarray data is a challenging problem as it typically relies on a few tens of samples but several thousand dimensions (genes). Feature selection techniques are commonly used in this context, both to increase the interpretability of the predictive model and possibly to reduce its cost. Feature selection aims at finding a small subset of the original covariates that best predicts the outcome. In the case of clinical studies, the selected genes are considered to be biomarkers forming a signature of a patient status or his expected response to a treatment. A good signature is also ideally stable with respect to sampling variation, under the assumption that the biological process modeled is (mostly) common across patients. We focus here on embedded methods for which a multivariate feature selection is performed jointly with the classifier estimation. We study in particular regularized (or penalized) linear models, such as extensions to linear support vector machines (SVM) or variants of the LASSO, since they offer state of the art predictive performances for high dimensional and sparse data. In this context, we describe two original contributions. Firstly, some prior knowledgeth May be available to bias the selection towards some genes a priori assumed to be more relevant. We present a novel optimization algorithm to make use of such a partial supervision as a soft constraint. A practical approximation of this technique reduces to standard SVM learning with iterative rescaling of the inputs. The scaling factors depend on the prior knowledge but the final selectionth May depart from it if necessary to optimize the classification objective. Secondly, we show how to adapt the above algorithm in a transfer learning setting: a preliminary selection is performed on one or several source dataset(s) and is subsequently used to bias the selection on a target dataset. This is particularly relevant for microarray data for which each individual dataset is typically very small but a fastly growing collection of related datasets are produced and made publicly available. Experimental results illustrate that both approaches improve the stability and classification performances of the resulting models. We conclude this talk by sketching some open issues, both from a theoretical and a practical viewpoint.

## Time's Arrow in Time Series.

• Friday, 26th November 2010, 16:00, B-351, EPS-UAM.
• Alberto Suárez.
• We conjecture that the distribution of the time-reversed residuals of a causal linear process is closer to a Gaussian than the distribution of the noise used to generate the process in the forward direction. This property is demonstrated for causal AR(1) processes assuming that all the cumulants of the distribution of the noise are defined. The analysis is readily extended to models that can be represented as causal vector AR processes. Based on this observation, it is possible to design a decision rule for detecting the direction of time series that can be modeled as linear processes: The true direction of the time series is identified as the one in which the residuals of a linear fit are less Gaussian.
• Slides

## Sparsifying LS-SVM Models via L0-Norm Minimization.

• Friday 12th November 2010, 16:00, B-351, EPS-UAM.
• Jorge López Lázaro.
• Least-Squares Support Vector Machines (LS-SVMs) have been successfully applied in many classification and regression problems as an alternative to the classical Support Vector Machine (SVM) formulations. The main drawback of LS-SVMs with respect to these is the lack of sparseness of the final models. Thus, a procedure to sparsify LS-SVM models is a frequent desideratum. In this talk, I will explain how we can adapt to the LS-SVM case a very recent work by Huang et al. for sparsifying classical SVM classifiers, which is based on an iterative approximation to the L0-norm of the vector of Lagrangian coefficients. Experiments on classification and regression on real datasets show that this adaptation achieves very sparse models, without any significant loss of accuracy compared to the LS-SVM models.
• Slides

## Seminario Previo de Tesis: Balancing Flexibility and Robustness in Machine Learning: Semi-parametric Methods and Sparse Linear Models.

• Friday, 5th November 2010, 11:00, Sala de Grados, EPS-UAM.
• Jose Miguel Hernández Lobato.
• Machine learning problems can be addressed by a variety of methods that span a wide range of degrees of flexibility and robustness. In the process of building a model for data, flexibility and robustness are desirable but often conflicting goals. On one side of the spectrum, parametric methods are very robust, in the sense that they are resilient to noise and are not generally misled by spurious regularities, whichth May be present in the data only by accident. However, their expressive capacity is limited. On the other side, non-parametric methods are very flexible and can in principle learn arbitrarily complex patterns when sufficient amounts of data are available for induction. However, as a result of this high flexibility, they are also more prone to overfitting. In practice, selecting the optimal method to address a specific learning task involves attaining the appropriate balance between flexibility and robustness. There are some learning problems for which this balance cannot be attained using standard parametric or purely non-parametric approaches in isolation. Semi-parametric methods include both parametric and non-parametric components in the models assumed. The parametric part provides a robust description of some of the patterns in the data. The non-parametric component endows the model with the flexibility necessary to capture additional complex patterns. In this thesis, we analyze several problems in which semi-parametric methods provide accurate models for the data. The first one is the modeling of financial time series. The trends in these series are described by parametric models. The density of the innovations is directly learned from the data in a non-parametric manner. To improve the quality of the approximation, the estimation of the density of the innovations is performed in a transformed space, where the density of the transformed data is close to a Gaussian. A second problem involves developing semi-parametric models to describe arbitrary non-linear dependencies between two random variables. Bivariate Archimedean copulas are re-parameterized in terms of a unidimensional latent function that can be readily approximated using a basis of natural cubic splines. These splines are especially well suited to model the asymptotic tail dependence of the data. In some learning problems even simple parametric methods are not sufficiently robust to provide accurate descriptions for the data. This investigation also addresses the specific question of how to improve the robustness of linear models by assuming sparsity in the model coefficients. In a Bayesian approach, sparsity can be favored by using specific priors, such as the spike and slab distribution. The advantage of the spike and slab prior is its superior selective shrinkage capacity: Some coefficients (those whose posterior has a large contribution from the spike) are forced to be small, while others (those in which the slab is the predominant contribution to the posterior) are not regularized. In this thesis, linear models with spike and slab priors are used to address problems with a high-dimensional feature space and small number of available training instances. Approximate inference is implemented using Expectation propagation (EP). For the sparse linear regression model, EP is a computationally efficient alternative to MCMC methods, which are asymptotically exact, but often require lengthy computations to converge. Another contribution is the design of a sparse Bayesian classifier for classification problems in which prior information about feature dependencies is available. Finally, a sparse linear model that makes use of a hierarchical spike and slab prior is applied to the problem of identifying regulatory genes from gene expression time series. The semi-parametric methods and the sparse linear models analyzed in this thesis represent configurations of flexibility and robustness that cannot be attained by either standard parametric methods or by fully non-parametric approaches alone. Therefore, the proposed methods fill in some of the gaps left by these standard learning paradigms in the flexibility-robustness spectrum.

## Accelerating SVM Training: beyond SMO.

• Friday, 29th October 2010, 16:00, B-351, EPS-UAM.
• Álvaro Barbero.
• In this talk I will present part of our ongoing work in trying to improve the speed of Support Vector Machines training algorithms. I will analyze the structure of the optimization problem posed by SVMs, showing that advanced, well-understood optimization methods fail to solve the problem efficiently for medium to large sized datasets. On the contrary, the current state of the art algorithm (SMO) follows an approach focused on iterating low-cost steps, in which an update as simplest as posible is performed. While this approach guarantees good running times, its simplicity results in a large number of iterations. We will see two proposed algorithms based on SMO - Cycle-Breaking and Momentum SMO - that try to reduce this number of iterations by introducing more complex updates, while at the same time maintaining the computational cost at bay. We will conclude that, although these algorithms manage to obtain some improvements in training times, more work is needed to get significant results.
• Slides

## Path Integral Reinforcement Learning.

• Friday, 22nd October 2010, 15:00, Sala de Grados, EPS-UAM.
• Bert Kappen.
• Stochastic optimal control theory provides a principled answer to the problem of computing an optimal sequence of actions to reach a goal in the presence of uncertainty. The solution is based on dynamic programming and known as the Bellman equation. However, the actual computation is typically very costly and scales exponentially in the dimension of the problem. Recently, it was shown that a quite large class of non-linear control problems could be solved using an alternative approach using a diffusion process. The optimal control can be represented as a path integral, an expectation over future trajectories. This solution can be computed much more efficiently, using MCMC. In this talk, I show how the path integral control formalism can be used to include learning, ie. when the dynamics of the plant and the cost are not known. This situation occurs often in robotics applications. I will demonstrate the path integral reinforcement learning for the well known mountain car problem.

## Short Course on Control Theory and Dynamic Programming.

• Tuesday 19th October, 16:00-18:00 [2 hrs. lecture].
Thursday 21st October, 17:00-20:00 [3 hrs. lecture].
Monday 25th October, 18:00-20:00 [1 hr lecture + 1 hr. lab].
Tuesday 26th October, 16:00-18:00 [2 hrs. lab].
• Bert Kappen.
• The course provides an introduction to stochastic optimal control theory. The course is in part based on a tutorial given by me and Marc Toussaint at ICML 2008 and on some selected material from the book Dynamic programming and optimal control by Dimitri Bertsekas.

## A First Approach to Artificial Cognitive Control System Implementation Based on the Shared Circuits Model of Sociocognitive Capacities.

• Friday 22nd October 2010, 16:00, B-351, EPS-UAM.
• A. Sánchez Boza, R. Haber Guerra.
• A first approach for designing and implementing an artificial cognitive control system based on the shared circuits models is presented in this work. The shared circuits model approach of sociocognitive capacities recently proposed by Hurley [1] is enriched and improved in this work. A five-layer computational architecture for designing artificial cognitive control systems is proposed on the basis of a modified shared circuits model for emulating sociocognitive experiences such as imitation, deliberation, and mindreading. An artificial cognitive control system is applied for controlling force in a manufacturing process that demonstrates the suitability of the suggested approach.

Keywords: artificial cognitive control; embodied cognition; imitation; internal model control; mirroring; shared circuits model

References: [1] S. Hurley, The Shared circuits model (SCM): How control, mirroring, and simulation can enable imitation, deliberation, and mindreading, Behavioural and Brain Science.

## Hub Gene Selection Methods for the Reconstruction of Transcription Networks.

• Wednesday 15th September 2009, 10:00, B-351, EPS-UAM.
• José Miguel Hernández Lobato.
• Transcription control networks have a scale-free topological structure: While most genes are involved in a reduced number of links, a few hubs or key regulators are connected to a significantly large number of nodes. Several methods have been developed for the reconstruction of these networks from gene expression data, e.g. ARACNE. However, few of them take into account the scale-free structure of transcription networks. In this paper, we focus on the hubs that commonly appear in scale-free networks. First, three feature selection methods are proposed for the identification of those genes that are likely to be hubs and second, we introduce an improvement in ARACNE so that this technique can take into account the list of hub genes generated by the feature selection methods. Experiments with synthetic gene expression data validate the accuracy of the feature selection methods in the task of identifying hub genes. When ARACNE is combined with the output of these methods, we achieve up to a 62% improvement in performance over the original reconstruction algorithm. Finally, the best method for identifying hub genes is validated on a set of expression profiles from yeast.

## Machine Learning Challenges in Ecological Science and Ecosystem Management.

• Thursday 17th June 2010, 11:00, Sala de Grados, EPS-UAM.
• Thomas G. Dietterich.
• Just as machine learning has played a huge role in genomics, there are many problems in ecological science and ecosystem management that could be transformed by machine learning. This talk will give an overview of several research projects at Oregon State University in this area and discuss the novel machine learning problems that arise. These include (a) automated data cleaning and anomaly detection in sensor data streams, (b) automated interpretation of images and video for field studies (including automated recognition of insects, automated discovery of new insect species, and automated modeling of insect behavior), (c) species distribution modeling including modeling of bird migration, and (d) design of optimal policies for managing wildfires in forest ecosystems. The machine learning challenges include flexible anomaly detection for multiple data streams, trainable high-precision object recognition systems, video activity recognition, inverse reinforcement learning, inverse stochastic game learning, and optimization of complex spatio-temporal Markov processes.

## Empirical Growth-Optimal Portfolio Selection.

• Monday 14th June 2010, 11:00, Sala de Grados, EPS-UAM.
• Dr. László Györfi.
• This talk is on sequential investment strategies for financial markets. Investment strategies are allowed to use information collected from the past of the market and determine, at the beginning of a trading period, a portfolio, that is, a way to distribute their current capital among the available assets. The goal of the investor is to maximize his wealth on the long run without knowing the underlying distribution generating the stock prices. Since accurate statistical modelling of stock market behavior has been known as a notoriously difficult problem, we take an extreme point of view and work with minimal assumptions on the distribution of the time series, i.e., we only assume that the daily price relatives form a stationary process. Under this assumption the asymptotic rate of growth (averaged yield) has a well-defined maximum, called log-optimum, which can be achieved in full knowledge of the distribution of the entire process.

## Iterative Gaussianization Framework.

• Tuesday, 16th March 2010, 12:00, B-351, EPS-UAM.
• Gustavo Camps-Valls.
• We generalize a class of projection pursuit methods to transform arbitrary multidimensional data into multivariate Gaussian data, thus attaining statistical independence of its components. The factorization of the original probability density function (PDF) is very useful to tackle density estimation and unsupervised learning problems. The proposed analysis enables a number of novel ways to solve practical problems in high-dimensional scenarios, such as those encountered in image processing, speech recognition, array processing, or bioinformatics. When data come from a linear transformation of independent non-Gaussian sources, independent component analysis (ICA) methods can efficiently solve the factorization problem. However, when the transformation is non-linear, ICA methods are no longer useful. The general framework consists of the sequential application of a two-step processing unit: univariate marginal Gaussianization transforms followed by an orthogonal transform. This iterative scheme generalizes previous ICA and PCA based projection pursuit methods to include even random rotations. Relation to other methods, such as deep neural networks is pointed out. The considered class of methods is shown to be invertible and differentiable for any rotation while its convergence properties do depend on the selected rotation. The performance is successfully illustrated in a number of multidimensional data processing problems such as image synthesis, classification, denoising, and multi-information estimation.

## Computational Intelligence Techniques in Wind Forecast for Wind Farms and Related Problems.

• Thursday, 11th March 2010, 12:30, Sala de Grados, EPS-UAM.
• Sancho Salcedo Sanz.
• One of the european objectives is to reach a 20% of renewable energy in the total supply in 2020, so wind energy development goes on in next years. In particular, it is expected that the 16% of the electricity generated in Europe could come from wind. This stimation places the wind energy as one of the principal actors in the energetic mix of the different countries that are betting for its development. Wind properties entail important problems for the integration in the net, that must be minimized to reach the proposed aims. In this scenario of wind energy high integration, wind velocity plays an important role for producers and in net management. This seminar is focused on describing system series based on computational intelligence techniques for short term wind velocity forecast in wind farms. In addition, it will be explained not resolved problems like emerging algorithms for optimizing wind farms location, mesoscale models for optimizing techniques, or large term wind power models in wind farms, using non supervised classification methods.

## Multi-Label Learning: Algorithms and Applications.

• Thursday, 21st January 2010, 16:00, Sala de Grados, EPS-UAM.
• Dr. Grigorios Tsoumakas.
• Supervised learning has traditionally focused on the analysis of single-label data, where training examples are associated with a single label from a set of disjoint labels. However, training examples in several application domains, such as text and web mining, semantic annotation of images and video, music categorization into genres and emotions, functional genomics and direct marketing, are often associated with a set of labels. Such data are characterized as multi-label. This talk will introduce the topic of learning from multi-label data, report motivating applications, present the main existing learning methods, discuss interesting current research challenges and present an open-source software library for using and further developing multi-label learning algorithms.

## KL Control Theory and Decision Making under Uncertainty.

• Wednesday, 20th January 2010, 16:00, Sala de Grados, EPS-UAM.
• Bert Kappen.
• KL control theory consists of a class of control problems for which the control computation can be solved as a graphical model inference problem. In this talk, we show how to apply this theory in the context of a delayed choice task and for collaborating agents. We first introduce the KL control framework. Then we show that in a delayed reward task when the future is uncertain it is optimal to delay the timing of your decision. We show preliminary results on human subjects that confirm this prediction. Subsequently, we discuss two player games, such as the stag-hunt game, where collaboration can improve or worsten as a result of recursive reasoning about the opponents actions. The Nash equilibria appear as local minima of the optimal cost to go, butth May disappear when monetary gain decreases. This behaviour is in agreement with experimental findings in humans. We subsequently extend the setting to delayed rewards and show how cooperation develops as a result of recursive reasoning.Suboptimal cooperation arise as local minima of the objective function.

## First and Second Order SMO Algorithms for LS-SVM Training.

• Friday, 4th December 2009, 11:10, B-351, EPS-UAM.
• Jorge López.
• LS-SVM training for large datasets in classification and regression has been traditionally addressed with conjugate gradient algorithms. In this work, completing the study by Keerthi et al., we explore the applicability of the SMO algorithm for solving the LS-SVM problem, by comparing first order and second order working set selections. It turns out that, depending on the value of the hyperparameters for the problem at hand, one of the working set selections is nearly always more convenient than the other one. In any case, whichever the selection scheme is, the number of kernel operations performed by SMO is shown to scale quadratically with the number of patterns.

## Pruning Boosting Ensembles.

• Friday, 30th October 2009, 11:10, B-351, EPS-UAM.
• Sergio García-Moratilla.
• A classifier is a system that discriminates among the instances in a dataset on the basis of their attributes and assigns them a class label. An ensemble is a set of classifiers whose individual decisions are combined to produce a final classification. Ensemble learning is an important area of research because ensembles are often more accurate than the individual classifiers that compose them. Specifically, if the classifiers of the ensemble are complementary, that is, if their errors are uncorrelated, then the generalization error of the ensemble is lower than the error of the base classifiers. Well-known ensemble methods are bagging and boosting. Bagging builds classifiers from bootstrap samples of the training dataset. Bagging is a parallel ensemble method. Given a training set, the classifiers built by bagging are generated independently. Boosting constructs classifiers by modifying the weights of the training instances, so that the new classifiers progressively focus on instances that are difficult to classify by previous members of the ensemble...

## Prediction Based on Averages over Automatically Induced Learners: Ensemble Methods and Bayesian Techniques.

• Friday, 23rd October 2009, 11:30, Sala de Grados, EPS-UAM.
• Daniel Hernández Lobato.
• Ensemble methods and Bayesian techniques are two supervised machine learning paradigms that can be useful to alleviate the problems derived from learning in the presence of limited data and noisy instances. However, the practical use of these two paradigms presents some complications. Ensemble methods often have large memory requirements and large prediction costs. Furthermore, in general it is difficult to estimate an adequate value for the ensemble size. On the other hand, a Bayesian approach requires the evaluation of very difficult integrals or summations that are often intractable. In practice, these computations are approximated by methods that can be computationally expensive. This thesis concentrates on improving the task of learning from labeled data when using ensembles methods or Bayesian techniques. For this purpose, we address the problems described before. In the first part of this thesis we analyze different ensemble pruning methods that reduce the memory requirements and prediction time of ensembles. We show that optimal subensembles defined in terms of the minimum training error can only be found in regression bagging ensembles of intermediate size. In ensembles of large size two approximate methods are proposed: ordered aggregation (OA) and SDP-pruning. Both OA and SDP-pruning select subensembles with better prediction accuracy than the complete ensemble. In classification ensembles we show that it is possible to make inference about the final ensemble prediction after querying only a few of the total classifiers of the ensemble. This is the basis of a novel ensemble pruning method called instance-based (IB) pruning. IB pruning reduces the ensemble prediction time up to eight times without significantly deteriorating ensemble performance. In this part of the thesis we also study a statistical procedure for setting an adequate size for the ensemble. We show that the probabilistic framework of IB pruning can be used to infer the size of a classification ensemble so that, with a specified confidence level, the resulting ensemble predicts the same class label as an ensemble of infinite size. The second part of this thesis focuses on improving the computational cost of Bayesian techniques. In particular, we introduce novel applications of the expectation propagation (EP) algorithm. These applications reduce the use of more computationally expensive methods like Markov chain Monte Carlo or type-II maximum likelihood estimation. We show that EP can be used in a Bayesian model called the Bayes machine (BM) to approximate the posterior distribution of a parameter that quantifies the level of noise in the class labels. When EP is used to compute the approximation, the BM does not require any re-training to estimate this parameter. Additionally, the training cost of the BM can be reduced by using a sparse representation for the model. This representation is found by a greedy algorithm whose performance is improved by considering extra refining iterations. Finally, we show that the EP algorithm can be used to approximate the posterior distribution of a Bayesian model for microarray data classification. EP significantly reduces the training cost of this model and is also useful to identify relevant genes for subsequent analysis.

## Detection of Unusual Objects and Temporal Patterns in EEG Video Recordings.

• Friday, 9th October 2009, 11:00, B-428, EPS-UAM.
• In this paper we show that by using a modification of our previously developed probabilistic method for finding the most unusual part of a 3D digital image, we can detect the temporal intervals and areas of interest in the signals/video and mark the corresponding objects that behave in an unusual way. Due to the different dynamics along the temporal and the spatial axes, namely the prevalence of the cylinder-like objects in the video and the pseudo-periodic slowly changing spectral characteristics of the bio-electrical signals, an additional step is needed to treat the temporal axis. One of the possible practical applications of the method can be in In- tensive Care hospital Units (ICU), where EEG video recording is a standard practice to ensure that a potentially life-threatening event can be detected even if its indications are present only in a fraction of the observed signals.

## Implicit Wiener Series Analysis of Epileptic Seizure Recordings.

• Friday, 2nd October 2009, 11:00, B-351, EPS-UAM.
• Álvaro Barbero
• Implicit Wiener series are a powerful tool to build Volterra representations of time series with any degree of non-linearity. A natural question is then whether higher order representations yield more useful models. In this work we shall study this question for ECoG data channel relationships in epileptic seizure recordings, considering whether quadratic representations yield more accurate classifiers than linear ones. To do so we first show how to derive statistical information on the Volterra coefficient distribution and how to construct seizure classification patterns over that information. As our results illustrate, a quadratic model seems to provide no advantages over a linear one. Nevertheless, we shall also show that the interpretability of the implicit Wiener series provides insights into the inter-channel relationships of the recordings.

## Short Term Wind Power Forecasting and Applications.

• Friday, 25th September 2009, 12:00h, Sala de Grados, EPS-UAM.
• Julio Usaola García.
• This seminar is divided in two different parts. In the first one, it was presented an introduction to the short term wind power forecasting, with the concret characteristics of a forecasting application actually used as main topic. In the second part, it was described in detail two short term wind forecasting applications: the first one focused on the participation on wind power markets and the second one focused on trying to prevent electric net congestions.

## Expectation Propagation for Microarray Data Classification.

• Tuesday, 18th August 2009, 17:00, B-351, EPS-UAM.
• Daniel Hernández Lobato.
• Microarray experiments are a very promising tool for disease treatment and early diagnosis. However, the datasets obtained in these experiments typically have a rather small number of instances and a large number of covariates, most of which are irrelevant for discrimination. These features usually lead to instabilities in microarray classification algorithms. Bayesian methods can be useful to overcome this problem because they compute probability distributions for the model coefficients rather than point estimates. However, exact Bayesian inference is often infeasible and hence, some form of approximation has to be made. In this talk we propose a Bayesian model for microarray data classification based on a prior distribution that enforces sparsity in the model coefficients. Expectation Propagation (EP) is then used to perform approximate inference as an alternative to more computationally demanding methods like Markov Chain Monte Carlo (MCMC) sampling. The proposed model is evaluated on fifteen microarray datasets and compared with other popular classification algorithms. These experiments show that the model trained with EP performs well on the datasets investigated and is also useful for identifying relevant genes for subsequent analysis.

## Modeling Dependence in Financial Data with Semiparametric Archimedean Copulas.

• Wednesday, 15th July 2009, 15:00, B-351, EPS-UAM.
• José Miguel Hernández Lobato.
• Copulas are useful tools for the construction of multivariate models because they allow to link univariate marginals into a joint model with arbitrary dependence structure. While non-parametric copula models can have poor generalization performance, standard parametric copulas often lack expressive capacity to capture the dependencies present in financial data. In this work, we propose a novel semiparametric bivariate Archimedean copula model that is expressed in terms of a latent function. This latent function is approximated using a basis of natural splines and the model parameters are selected by maximum penalized likelihood. Experiments on financial data are used to evaluate the accuracy of the proposed estimator with respect to other benchmark methods. The proposed semiparametric copula model has excellent in and out-of-sample performance, which makes it a useful tool for modeling multivariate financial data.

## Short Term Forecasting Models for Spanish Electric Market Prices.

• Tuesday, 26th May 2009, 12:00, Sala de Grados, EPS-UAM.
• Antonio Muñoz San Roque.
• This seminar presented a comparative study of models of short term forecasting of hourly prices in the spanish electric market context. Firstly, it was reviewed the different properties of the series of prices and the principal features wich are involved in the trainning process, and wich could be considered exogenous variables. Then, the models used in this study were presented: univariable ones, like ARIMA or Holt-Winters models, and multivariable models as transfer function, VAR or periodic models. At the end, the experiment results were analyzed and future developments were exposed.

## Detecting the Most Unusual Part of Two and Three-dimensional Digital Images.

• Friday, 9th January 2009, 11:00, B-351, EPS-UAM.
• The purpose of this paper is to introduce an algorithm that can detect the most unusual part of a digital image in probabilistic setting. The most unusual part of a given shape is defined as a part of the image that has the maximal distance to all non intersecting shapes with the same form. The method is tested on two and three-dimensional images and has shown very good results without any predefined model. A version of the method independent of the contrast of the image is considered and is found to be useful for finding the most unusual part (and the most similar part) of the image conditioned on given image. The results can be used to scan large image databases, as for example medical databases.

## Optimal control as a graphical model inference problem.

• Monday, 22nd December, 2008, 11:00, B-351, EPS-UAM.
• H. J. Kappen.
• Stochastic optimal control theory deals with the problem to compute an optimal set of actions to attain some future goal. Examples are found in many contexts such as motor control tasks for robotics, planning and scheduling tasks or managing a financial portfolio. The computation of the optimal control is typically very difficult due to the size of the state space and the stochastic nature of the problem. We introduce a class of stochastic optimal control problems that can be mapped onto a probabilistic inference problem. This duality between control and inference is well-known. The novel aspect of the present formulation is that the optimal solution is given by the minimum of a free energy. We can thus apply principled approximations such as the mean field method or belief propagation to obtain efficient approximations. We will illustrate the method for the task stacking blocks.

## Método para el Diseño y Coordinación de Agentes en Aplicaciones de Control Multiopbjetivo de Procesos.

• Friday, 14th November 2008, 12:00, B-428, EPS-UAM.
• Diego Martín Andrés.
• No abstract available.

## Sparse Bayes Machines for Binary Classification.

• Friday, 31st October 2008, 11:00, B-351, EPS-UAM.
• Daniel Hernández Lobato.
• In this talk we propose a sparse representation for the Bayes Machine based on the approach followed by the Informative Vector Machine (IVM). However, some extra modifications are included to guarantee a better approximation to the posterior distribution. That is, we introduce additional refining stages over the set of active patterns included in the model. These refining stages can be thought as a back- fitting algorithm that tries to fix some of the mistakes that result from the greedy approach followed by the IVM. Experimental comparison of the proposed method with a full Bayes Machine and a Support Vector Machine seems to confirm that the method is competitive with these two techniques. Statistical tests are also carried out to support these results.

## Validación de Clusters basada en la Forma.

• Friday, 17th October 2008, 11:00, B-351, EPS-UAM.
• Luis Fernando Lago.