NUEVAS APROXIMACIONES PARA EL APRENDIZAJE MULTIOBJETIVO DE MODELOS DE CLASIFICACION SUPERVISADA Y PARA LA SINTESIS DE CONOCIMIENTO EN MODELOS DE ANALISIS DE DECISIONES

TIN2007-62626

Nombre agencia financiadora Ministerio de Educación y Ciencia
Acrónimo agencia financiadora MEC
Programa Otro
Subprograma Otro
Convocatoria Proyectos de Investigación
Año convocatoria 2007
Unidad de gestión Subdirección General de Proyectos de Investigación
Centro beneficiario UNIVERSIDAD POLITECNICA DE MADRID
Centro realización FACULTAD DE INFORMATICA
Identificador persistente No disponible

NUEVAS APROXIMACIONES PARA EL APRENDIZAJE MULTIOBJETIVO DE MODELOS DE CLASIFICACION SUPERVISADA Y PARA LA SINTESIS DE CONOCIMIENTO EN MODELOS DE ANALISIS DE DECISIONES

TIN2007-62626

Nombre agencia financiadora Ministerio de Educación y Ciencia
Acrónimo agencia financiadora MEC
Programa Otro
Subprograma Otro
Convocatoria Proyectos de Investigación
Año convocatoria 2007
Unidad de gestión Subdirección General de Proyectos de Investigación
Centro beneficiario UNIVERSIDAD PÚBLICA DE NAVARRA (UPNA)
Centro realización UNIVERSIDAD PUBLICA DE NAVARRA
Identificador persistente No disponible

Publicaciones

Resultados totales (Incluyendo duplicados): 17
Encontrada(s) 1 página(s)

Mateda-2.0: estimation of distribution algorithms in MATLAB

Academica-e. Repositorio Institucional de la Universidad Pública de Navarra
  • Larrañaga, Pedro
  • Santana, Roberto
  • Bielza, Concha
  • Lozano, José Antonio
  • Echegoyen Arruti, Carlos
  • Mendiburu, Alexander
  • Armañanzas, Rubén
  • Shakya, Siddartha
This paper describes Mateda-2.0, a MATLAB package for estimation of distribution algorithms (EDAs). This package can be used to solve single and multi-objective discrete and continuous optimization problems using EDAs based on undirected and directed probabilistic graphical models. The implementation contains several methods commonly employed by EDAs. It is also conceived as an open package to allow users to incorporate different combinations of selection, learning, sampling, and local search procedures. Additionally, it includes methods to extract, process and visualize the structures learned by the probabilistic models. This way, it can unveil previously unknown information about the optimization problem domain. Mateda-2.0 also incorporates a module for creating and validating function models based on the probabilistic models learned by EDAs., This work has been partially supported by the Saiotek and Research Groups 2007-2012 (IT-242-07) programs (Basque Government), TIN2008-06815-C02-01, TIN2008-06815-C02-02, TIN2007-62626 and Consolider Ingenio 2010 - CSD2007-00018 projects (Spanish Ministry of Science and Innovation), the CajalBlueBrain project, and the COMBIOMED network in computational biomedicine (Carlos III Health Institute).




Optimizing brain networks topologies using multi-objective evolutionary computation

Archivo Digital UPM
  • Santana, Roberto
  • Bielza Lozoya, María Concepción
  • Larrañaga Múgica, Pedro María
The analysis of brain network topological features has served to better understand these networks and reveal particular characteristics of their functional behavior. The distribution of brain network motifs is particularly useful for detecting and describing differences between brain networks and random and computationally optimized artificial networks. In this paper we use a multi-objective evolutionary optimization approach to generate optimized artificial networks that have a number of topological features resembling brain networks. The Pareto set approximation of the optimized networks is used to extract network descriptors that are compared to brain and random network descriptors. To analyze the networks, the clustering coefficient, the average path length, the modularity and the betweenness centrality are computed. We argue that the topological complexity of a brain network can be estimated using the number of evaluations needed by an optimization algorithm to output artificial networks of similar complexity. For the analyzed network examples, our results indicate that while original brain networks have a reduced structural motif number and a high functional motif number, they are not optimal with respect to these two topological features. We also investigate the correlation between the structural and functional motif numbers, the average path length and the clustering coefficient in random, optimized and brain networks.




Classifying evolving data streams with partially labeled data

Archivo Digital UPM
  • Borchani, Hanen
  • Larrañaga Múgica, Pedro María
  • Bielza Lozoya, María Concepción
Recently, several approaches have been proposed to deal with the increasingly challenging task of mining concept drifting data streams. However, most are based on supervised classification algorithms assuming that true labels are immediately and entirely available in the data streams. Unfortunately, such an assumption is often violated in real-world applications given that it is expensive or because it takes a long time to obtain all true labels. To deal with this problem, we propose in this paper a new semi-supervised approach for handling concept-drifting data streams containing both labeled and unlabeled instances. First, contrary to existing approaches, we monitor three possible kinds of drift: feature, conditional or dual drift. Drift detection is based on a hypothesis test comparing Kullback-Leibler divergence between old and recent data, whose distribution under the null hypothesis of coming from the same distribution is approximated via a bootstrap method. Then, if any drift occurs, anew classifier is learned from the recent data using the EM algorithm; otherwise, the current classifier is left unchanged. Our approach is so general that it can be applied to different classification models. Experimental studies, using the naive Bayes classifier and logistic regression, on both synthetic and real-world data sets demonstrate that our approach performs well.




Regularized logistic regression without a penalty term: an application to cancer classification with microarray data

Archivo Digital UPM
  • Bielza Lozoya, María Concepción
  • Robles Forcada, Víctor
  • Larrañaga Múgica, Pedro María
Regularized logistic regression is a useful classification method for problems with few samples and a huge number of variables. This regression needs to determine the regularization term, which amounts to searching for the optimal penalty parameter and the norm of the regression coefficient vector. This paper presents a new regularized logistic regression method based on the evolution of the regression coefficients using estimation of distribution algorithms. The main novelty is that it avoids the determination of the regularization term. The chosen simulation method of new coefficients at each step of the evolutionary process guarantees their shrinkage as an intrinsic regularization. Experimental results comparing the behavior of the proposed method with Lasso and ridge logistic regression in three cancer classification problems with microarray data are shown.




Optimal row and column ordering to improve table interpretation using estimation of distribution algorithms

Archivo Digital UPM
  • Bengoetxea, Endika
  • Larrañaga Múgica, Pedro María
  • Bielza Lozoya, María Concepción
  • Fernández del Pozo de Salamanca, Juan Antonio
A common information representation task in research as well as educational and statistical practice is to comprehensively and intuitively express data in two-dimensional tables. Examples include tables in scientific papers, as well as reports and the popular press.
Data is often simple enough for users to reorder. In many other cases though, there are complex data patterns that make finding the best re-arrangement of rows and columns for optimum readability a tough problem.
We propose that row and column ordering should be regarded as a combinatorial optimization problem and solved using evolutionary computation techniques. The use of genetic algorithms has already been proposed in the literature. This paper proposes for the first time the use of estimation of distribution algorithms for table ordering. We also propose alternative ways of representing the problem in order to reduce its dimensionality. By learning a selective naive Bayes classifier, we can find out how to jointly combine the parameters of these algorithms to get good table orderings. Experimental examples in this paper are on 2D tables.




Learning factorizations in estimation of distribution algorithms using affinity propagation

Archivo Digital UPM
  • Santana, Roberto
  • Larrañaga Múgica, Pedro María
  • Lozano Alonso, José Antonio
Estimation of distribution algorithms (EDAs) that use marginal product model factorization shave been widely applied to a broad range of mainly binary optimization problems. In this paper, we introduce the affinity propagation EDA (AffEDA) which learns a marginal product model by clustering a matrix of mutual information learned from the data using a very efficient message-passing algorithm known as affinity propagation. The introduced algorithm is tested on a set of binary and nonbinary decomposable functions and using a hard combinatorial class of problem known as theHP protein model. The results show that the algorithm is a very efficient alternative to other EDAs that use marginal product model factorizations such as the extended compact genetic algorithm (ECGA) and improves the quality of the results achieved by ECGA when the cardinality of the variables is increased.




Mateda-2.0:a MATLAB package for the implementation and analysis of estimation of distribution algorithms

Archivo Digital UPM
  • Santana, Roberto
  • Bielza Lozoya, María Concepción
  • Larrañaga Múgica, Pedro María
  • Lozano Alonso, José Antonio
  • Echegoyen Urruti, Carlos
  • Mendiburu Alberro, Alexander
  • Armañanzas Arnedillo, Ruben
  • Shakya, Siddartha
This paper describes Mateda-2.0, a MATLAB package for estimation of distribution algorithms (EDAs). This package can be used to solve single and multi-objective discrete and continuous optimization problems using EDAs based on undirected and directed probabilistic graphical models. The implementation contains several methods commonly employed by EDAs. It is also conceived as an open package to allow users to incorpórate different combinations of selection, learning, sampling, and local search procedures. Additionally, it includes methods to extract, process and visualize the structures learned by the probabilistic models. This way, it can unveil previously unknown information about the optimization problem domain. Mateda-2.0 also incorporates a module for creating and validating function models based on the probabilistic models learned by EDAs.




Multidimensional statistical analysis of the parameterization of a genetic algorithm for the optimal ordering of tables

Archivo Digital UPM
  • Bielza Lozoya, María Concepción
  • Fernández del Pozo de Salamanca, Juan Antonio
  • Larrañaga Múgica, Pedro María
  • Bengoetxea, Endika
The optimal table row and column ordering can reveal useful patterns to improve reading and interpretation. Recently, genetic algorithms using standard crossover and mutation operators have been proposed to tackle this problem. In this paper, we carry out an experimental study that adds to this genetic algorithm crossover and mutation operators specially designed to deal with permutations and includes other parameters (initialization, replacement policy, mutation and crossover rates and stopping criteria) not examined in previous works. A proper analysis of the results must take into account all the parameters simultaneously, since the wrong conclusions can be drawn by studying each separately from the others. This is why we propose a framework for a multidimensional analysis of the results. This includes multiple hypothesis testing and a regression tree that builds a parsimonious and predictive model of the suitable configurations of the parameters.




Predicting citation count of Bioinformatics papers within four years of publication

Archivo Digital UPM
  • Ibáñez Martín, Alfonso
  • Larrañaga Múgica, Pedro María
  • Bielza Lozoya, María Concepción
Motivation: Nowadays, publishers of scientific journals face the tough task of selecting high-quality articles that will attract as many readers as possible from a pool of articles. This is due to the growth of scientific output and literature. The possibility of a journal having a tool capable of predicting the citation count of an article within the first few years after publication would pave the way for new assessment systems.

Results: This article presents a new approach based on buildin several prediction models for the Bioinformatics journal. These models predict the citation count of an article within 4 years after publication (global models). To build these models, tokens found in the abstracts of Bioinformatics papers have been used as predictive features, along with other features like the journal sections and 2-week post-publication periods. To improve the accuracy of the global models, specific models have been built for each Bioinformatics journal section (Data and Text Mining, Databases and Ontologies, Gene Expression, Genetics and Population Analysis, Genome Analysis, Phylogenetics, Sequence Analysis, Structural Bioinformatics and Systems Biology). In these new models, the average success rate for predictions using the naive Bayes and logistic regression supervised classification methods was 89.4% and 91.5%, respectively, within the nine sections and for 4-year time horizon.

Availability: Supplementary material on this experimental survey is available at http://www.dia.fi.upm.es/∼concha/bioinformatics.html




Estimation of distribution algorithms as logistic regression regularizers of microarray classifiers

Archivo Digital UPM
  • Bielza Lozoya, María Concepción
  • Robles Forcada, Víctor
  • Larrañaga Múgica, Pedro María
Objectives: The “large k (genes), small N (samples)” phenomenon complicates the problem of microarray classification with logistic regression. The indeterminacy of the maximum likelihood solutions, multicollinearity of predictor variables and data over-fitting cause unstable parameter estimates. Moreover, computational problems arise due to the large number of predictor (genes) variables. Regularized logistic regression excels as a solution. However, the difficulties found here involve an objective function hard to be optimized from a mathematical viewpoint and a careful required tuning of the regularization parameters.
Methods: Those difficulties are tackled by introducing a new way of regularizing the logistic regression. Estimation of distribution algorithms (EDAs), a kind of evolutionary algorithms, emerge as natural regularizers. Obtaining the regularized estimates of the logistic classifier amounts to maximizing the likelihood function via our EDA, without having to be penalized. Likelihood penalties add a number of difficulties to the resulting optimization problems, which vanish in our case. Simulation of new estimates during the evolutionary process of EDAs is performed in such a way that guarantees their shrinkage while maintaining their probabilistic dependence relationships learnt. The EDA process is embedded in an adapted recursive feature elimination procedure, thereby providing the genes that are best markers for the classification.
Results: The consistency with the literature and excellent classification performance achieved with our algorithm are illustrated on four microarray data sets: Breast, Colon, Leukemia and Prostate. Details on the last two data sets are available as supplementary material.
Conclusions: We have introduced a novel EDA-based logistic regression regularizer. It implicitly shrinks the coefficients during EDA evolution process while optimizing the usual likelihood function. The approach is combined with a gene subset selection procedure and automatically tunes the required parameters. Empirical results on microarray data sets provide sparse models with confirmed genes and performing better in classification than other competing regularized methods.




Mining concept-drifting data streams containing labeled and unlabeled instances

Archivo Digital UPM
  • Borchani, Hanen
  • Larrañaga Múgica, Pedro María
  • Bielza Lozoya, María Concepción
Recently, mining data streams has attracted significant attention and has been considered as a challenging task in supervised classification. Most of the existing methods dealing with this problem assume the availability of entirely labeled data streams. Unfortunately, such assumption is often violated in real-world applications given that obtaining labels is a time-consuming and expensive task, while a large amount of unlabeled instances are readily available. In this paper, we propose a new approach for handling concept-drifting data streams containing labelled and unlabeled instances. First, we use KL divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classifier is learned using the EM algorithm; otherwise, the current classifier is kept unchanged. Our approach is general so that it can be applied with different classification models. Experiments performed with naive Bayes and logistic regression, on two benchmark datasets, show the good performance of our approach using only limited amounts of labeled instances.




Using probabilistic dependencies improves the search of conductance-based compartmental neuron models

Archivo Digital UPM
  • Santana Hermida, Roberto
  • Bielza Lozoya, María Concepción
  • Larrañaga Múgica, Pedro María
Conductance-based compartmental neuron models are traditionally used to investigate the electrophysiological properties of neurons. These models require a number of parameters to be adjusted to biological experimental data and this question can be posed as an optimization problem. In this paper we investigate the behavior of different estimation of distribution algorithms (EDAs) for this problem. We focus on studying the influence that the interactions between the neuron model conductances have in the complexity of the optimization problem. We support evidence that the use of these interactions during the optimization process can improve the EDA behavior.




Bivariate empirical and n-variate Archimedean copulas in estimation of distribution algorithms

Bivariate empirical and n-variate Archimedean copulas in estimation of distribution algorithms-->
Archivo Digital UPM
  • Cuesta-Infante, Alfredo
  • Santana Hermida, Roberto
  • Hidalgo Peréz, José Ignacio
  • Bielza Lozoya, María Concepción
  • Larrañaga Múgica, Pedro María
This paper investigates the use of empirical and Archimedean copulas as probabilistic models of continuous estimation of distribution algorithms (EDAs). A method for learning and sampling empirical bivariate copulas to be used in the context of n-dimensional EDAs is first introduced. Then, by using Archimedean copulas instead of empirical makes possible to construct n-dimensional copulas with the same purpose. Both copula-based EDAs are compared to other known continuous EDAs on a set of 24 functions and different number of variables. Experimental results show that the proposed copula-based EDAs achieve a better behaviour than previous approaches in a 20% of the benchmark functions.




Synergies between network-based representation and probabilistic graphical models for classification, inference and optimization problems in neuroscience

Archivo Digital UPM
  • Santana Hermida, Roberto
  • Bielza Lozoya, María Concepción
  • Larrañaga Múgica, Pedro María
Neural systems network-based representations are useful tools to analyze numerous phenomena in neuroscience. Probabilistic graphical models (PGMs) give a concise and still rich representation of complex systems from different domains, including neural systems. In this paper we analyze the characteristics of a bidirectional relationship between networks-based representations and PGMs. We show the way in which this relationship can be exploited introducing a number of methods for the solution of classification, inference and optimization problems. To illustrate the applicability of the introduced methods, a number of problems from the field of neuroscience, in which ongoing research is conducted, are used.




Learning a L1-regularized Gaussian Bayesian network in the space of equivalence classes

Archivo Digital UPM
  • Vidaurre Henche, Diego
  • Bielza Lozoya, María Concepción
  • Larrañaga Múgica, Pedro María
Learning the structure of a graphical model from data is a common task in a wide range of practical applications. In this paper we focus on Gaussian Bayesian networks (GBN), that is, on continuous data and directed graphs. We propose to work in an equivalence class search space that, combined with regularization techniques to guide the search of the structure, allows to learn a sparse network close to the one that generated the data.




Learning CB-decomposable multi-dimensional Bayesian network classifiers

Archivo Digital UPM
  • Borchani, Hanen
  • Bielza Lozoya, María Concepción
  • Larrañaga Múgica, Pedro María
Multi-dimensional Bayesian network classifiers (MBCs) have been recently introduced to deal with multi-dimensional classification problems where instances are assigned to multiple classes. MBCs have a restricted topology partitioning the set of class and feature variables into three different subgraphs: class subgraph, feature subgraph and bridge subgraph. In this paper, we propose a novel learning algorithm for class-bridge (CB) decomposable MBCs into maximal connected components. Basically, based on a wrapper greedy forward selection approach, the algorithm firstly learns the bridge and feature subgraphs. Then, while the number of components is greater than one and there is an accuracy improvement, it iteratively and sequentially merges together the components, and updates the bridge and feature subgraphs. By learning CB-decomposable MBCs, the computations of MPE are alleviated comparing to general MBCs. Experimental comparison with state-of-the-art algorithms are carried out using synthetic and real-world data sets. The obtained results show the merits of our proposed algorithm.




Mining probabilistic models learned by EDAs in the optimization of multi-objective problems

Archivo Digital UPM
  • Santana Hermida, Roberto
  • Bielza Lozoya, María Concepción
  • Lozano Ruiz, José Antonio
  • Larrañaga Múgica, Pedro María
One of the uses of the probabilistic models learned by estimation of distribution algorithms is to reveal previous unknown information about the problem structure. In this paper we investigate the mapping between the problem structure and the dependencies captured in the probabilistic models learned by EDAs for a set of multi-objective satisfiability problems. We present and discuss the application of different data mining and visualization techniques for processing and visualizing relevant information from the structure of the learned probabilistic models. We show that also in the case of multi-objective optimization problems, some features of the original problem structure can be translated to the probabilistic models and unveiled by using algorithms that mine the model structures.