Publicaciones

Resultados totales (Incluyendo duplicados): 21
Encontrada(s) 1 página(s)

Art-GenEvalGPT

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • D'Haro Enríquez, Luis Fernando
  • Gil Martín, Manuel
  • Luna Jiménez, Cristina
  • Esteban Romero, Sergio
  • Estecha Garitagoitia, Marcos
  • Bellver Soler, Jaime
  • Fernández Martínez, Fernando
<p></p>
<p>Description of the project</p>
<p></p>
<p>ASTOUND is an EIC funded project (No. 101071191) under the HORIZON-EIC-2021-PATHFINDERCHALLENGES-01 call. </p>
<p></p>
<p>The aim of the project is to develop an artificial conscious AI based on the Attention Schema Theory (AST) proposed by Michel Graziano. This theory proposes that consciousness arises from the brain's ability to create and maintain a simplified model of its own processing, particularly focusing attention on certain aspects of its internal and external environment. </p>
<p></p>
<p>The project entails creating an AI system capable of exhibiting consciousness-like behaviours by implementing principles from the AST. This involves constructing a model that simulates attentional processes, allowing the AI to prioritise and focus on relevant information while disregarding irrelevant stimuli. </p>
<p></p>
<p>The ASTOUND project will provide an Integrative Approach for Awareness Engineering to establish consciousness in machines, and targeting the following goals: </p>
<p></p>
<p>Develop an AI architecture for Artificial Consciousness based on the Attention Schema Theory (AST) through an internal model of the state of the attention. </p>
<p></p>

<p>Implement the proposed architecture into a contextually aware virtual agent and prove improved performance thanks to the Attention Schema; for instance, by providing coherent discussion, self-regulation, short-and-long term memory, personalisation capabilities. </p>
<p></p>
<p>Define novel ways to measure the presence and level of consciousness in both humans and machines.</p>
<p></p>
<p></p>
<p>Description of the dataset</p>
<p></p>
<p>The dataset includes synthetic dialogues in the art domain that can be used for training a chatbot to discuss artworks within a museum setting. Leveraging Large Language Models (LLMs), particularly ChatGPT, the dataset comprises over 13,000 dialogues generated using prompt-engineering techniques. The dialogues cover a wide range of user and chatbot behaviours, including expert guidance, tutoring, and handling toxic user interactions. </p>
<p></p>
<p>The ArtEmis dataset serves as a basis, containing emotion attributions and explanations for artworks sourced from the WikiArt website. From this dataset, 800 artworks were selected based on consensus among human annotators regarding elicited emotions, ensuring balanced representation across different emotions. However, an imbalance in art styles distribution was noted due to the emphasis on emotional balance. </p>
<p></p>
<p>Each dialogue is uniquely identified using a "DIALOGUE_ID", encoding information about the artwork discussed, emotions, chatbot behaviour, and more. The dataset is structured into multiple files for efficient navigation and analysis, including metadata, prompts, dialogues, and metrics. </p>
<p></p>
<p>Objective evaluation of the generated dialogues was conducted, focusing on profile discrimination, anthropic behaviour detection, and toxicity evaluation. Various syntactic and semantic-based metrics are employed to assess dialogue quality, along with sentiment and subjectivity analysis. Tools like the MS Azure Content Moderator API, Detoxify library and LlamaGuard aid in toxicity evaluation. </p>
<p></p>
<p>The dataset's conclusion highlights the need for further work to handle biases, enhance toxicity detection, and incorporate multimodal information and contextual awareness. Future efforts will focus on expanding the dataset with additional tasks and improving chatbot capabilities for diverse scenarios.</p>
Proyecto: EC/HE/101071191




Interpreting Sign Language Recognition Using Transformers and MediaPipe Landmarks

Archivo Digital UPM
  • Luna Jiménez, Cristina
  • Gil Martín, Manuel
  • Kleinlein, Ricardo
  • San Segundo Hernández, Rubén
  • Fernández Martínez, Fernando
Sign Language Recognition (SLR) is a challenging task that aims to bridge the communication gap between the deaf and hearing communities. In recent years, deep learning-based approaches have shown promising results in SLR. However, the lack of interpretability remains a significant challenge. In this paper, we seek to understand which hand and pose MediaPipe Landmarks are deemed the most important for prediction as estimated by a Transformer model. We propose to embed a learnable array of parameters into the model that performs an element-wise multiplication of the inputs. This learned array highlights the most informative input features that contributed to solve the recognition task. Resulting in a human-interpretable vector that lets us interpret the model predictions. We evaluate our approach on public datasets called WLASL100 (SRL) and IPNHand (gesture recognition). We believe that the insights gained in this way could be exploited for the development of more efficient SLR pipelines.




Sign Language Dataset for Automatic Motion Generation

Archivo Digital UPM
  • Villa Monedero, María
  • Gil Martín, Manuel
  • Sáez Trigueros, Daniel
  • Pomirski, Andrzej
  • San Segundo Hernández, Rubén
Several sign language datasets are available in the literature. Most of them are designed for sign language recognition and translation. This paper presents a new sign language dataset for automatic motion generation. This dataset includes phonemes for each sign (specified in HamNoSys, a transcription system developed at the University of Hamburg, Hamburg, Germany) and the corresponding motion information. The motion information includes sign videos and the sequence of extracted landmarks associated with relevant points of the skeleton (including face, arms, hands, and fingers). The dataset includes signs from three different subjects in three different positions, performing 754 signs including the entire alphabet, numbers from 0 to 100, numbers for hour specification, months, and weekdays, and the most frequent signs used in Spanish Sign Language (LSE). In total, there are 6786 videos and their corresponding phonemes (HamNoSys annotations). From each video, a sequence of landmarks was extracted using MediaPipe. The dataset allows training an automatic system for motion generation from sign language phonemes. This paper also presents preliminary results in motion generation from sign phonemes obtaining a Dynamic Time Warping distance per frame of 0.37.




Sign Language Motion Generation from Sign Characteristics

Archivo Digital UPM
  • Gil Martín, Manuel
  • Villa Monedero, María
  • Pomirski, Andrzej
  • Sáez Trigueros, Daniel
  • San Segundo Hernández, Rubén
This paper proposes, analyzes, and evaluates a deep learning architecture based on transformers for generating sign language motion from sign phonemes (represented using HamNoSys: a notation system developed at the University of Hamburg). The sign phonemes provide information about sign characteristics like hand configuration, localization, or movements. The use of sign phonemes is crucial for generating sign motion with a high level of details (including finger extensions and flexions). The transformer-based approach also includes a stop detection module for predicting the end of the generation process. Both aspects, motion generation and stop detection, are evaluated in detail. For motion generation, the dynamic time warping distance is used to compute the similarity between two landmarks sequences (ground truth and generated). The stop detection module is evaluated considering detection accuracy and ROC (receiver operating characteristic) curves. The paper proposes and evaluates several strategies to obtain the system configuration with the best performance. These strategies include different padding strategies, interpolation approaches, and data augmentation techniques. The best configuration of a fully automatic system obtains an average DTW distance per frame of 0.1057 and an area under the ROC curve (AUC) higher than 0.94.




Improving Hand Pose Recognition using Localization and Zoom Normalizations over MediaPipe Landmarks

Archivo Digital UPM
  • Remiro Pérez, Miguel Ángel
  • Gil Martín, Manuel
  • San Segundo Hernández, Rubén
Hand Pose Recognition presents significant challenges that need to be addressed, such as varying lighting conditions or complex backgrounds, which can hinder accurate and robust hand pose estimation. This can be mitigated by employing MediaPipe to facilitate the efficient extraction of representative landmarks from static images combined with the use of Convolutional Neural Networks. Extracting these landmarks from the hands mitigates the impact of lighting variability or the presence of complex backgrounds. However, the variability of the location and size of the hands is still not addressed by this process. Therefore, the use of processing modules to normalize these points regarding the location of the wrist and the zoom of the hands can significantly mitigate the effects of these variabilities. In all the experiments performed in this work based on American Sign Language alphabet datasets of 870, 27,000, and 87,000 images, the application of the proposed normalizations has resulted in significant improvements in the model performance in a resource-limited scenario. Particularly, under conditions of high variability applying both normalizations resulted in a performance increment of 45.08 %, increasing the accuracy from 43.94 ± 0.64 % to 89.02 ± 0.40 %.
Proyecto: EC/HE/101071191




Automatic Detection of Inconsistencies and Hierarchical Topic Classification for Open-Domain Chatbots

Automatic Detection of Inconsistencies and Hierarchical Topic Classification for Open-Domain Chatbots-->
Archivo Digital UPM
  • Rodríguez Cantelar, Mario
  • Estecha Garitagoitia, Marcos Santiago
  • D'Haro Enríquez, Luis Fernando
  • Matía Espada, Fernando
  • Córdoba Herralde, Ricardo de
Current State-of-the-Art (SotA) chatbots are able to produce high-quality sentences, handling different conversation topics and larger interaction times. Unfortunately, the generated responses depend greatly on the data on which they have been trained, the specific dialogue history and current turn used for guiding the response, the internal decoding mechanisms, and ranking strategies, among others. Therefore, it may happen that for semantically similar questions asked by users, the chatbot may provide a different answer, which can be considered as a form of hallucination or producing confusion in long-term interactions. In this research paper, we propose a novel methodology consisting of two main phases: (a) hierarchical automatic detection of topics and subtopics in dialogue interactions using a zero-shot learning approach, and (b) detecting inconsistent answers using k-means and the Silhouette coefficient. To evaluate the efficacy of topic and subtopic detection, we use a subset of the DailyDialog dataset and real dialogue interactions gathered during the Alexa Socialbot Grand Challenge 5 (SGC5). The proposed approach enables the detection of up to 18 different topics and 102 subtopics. For the purpose of detecting inconsistencies, we manually generate multiple paraphrased questions and employ several pre-trained SotA chatbot models to generate responses. Our experimental results demonstrate a weighted F-1 value of 0.34 for topic detection, a weighted F-1 value of 0.78 for subtopic detection in DailyDialog, then 81% and 62% accuracy for topic and subtopic classification in SGC5, respectively. Finally, to predict the number of different responses, we obtained a mean squared error (MSE) of 3.4 when testing smaller generative models and 4.9 in recent large language models.




Video Memorability Prediction From Jointly-learnt Semantic and Visual Features

Archivo Digital UPM
  • Martín Fernández, Iván
  • Kleinlein, Ricardo
  • Luna Jiménez, Cristina
  • Gil Martín, Manuel
  • Fernández Martínez, Fernando
The memorability of a video is defined as an intrinsic property of its visual features that dictates the fraction of people who recall having watched it on a second viewing within a memory game. Still, unravelling what are the key features to predict memorability remains an obscure matter. This challenge is addressed here by fine-tuning text and image encoders using a cross-modal strategy known as Contrastive Language-Image Pre-training (CLIP). The resulting video-level data representations learned include semantics and topic-descriptive information as observed from both modalities, hence enhancing the predictive power of our algorithms. Our proposal achieves in the text domain a significantly greater Spearman Rank Correlation Coefficient (SRCC) than a default pre-trained text encoder (0.575 ± 0.007 and 0.538 ± 0.007, respectively) over the Memento10K dataset. A similar trend, although less pronounced, can be noticed in the visual domain.We believe these findings signal the potential benefits that cross-modal predictive systems can extract from being fine-tuned to the specific issue of media memorability.
Proyecto: EC/HE/101071191




A Comprehensive Analysis of Parkinson’s Disease Detection Through Inertial Signal Processing

Archivo Digital UPM
  • Gil Martín, Manuel
  • Esteban Romero, Sergio
  • Fernández Martínez, Fernando
  • San Segundo Hernández, Rubén
When developing deep learning systems for Parkinson's Disease (PD) detection using inertial sensors, a comprehensive analysis of some key factors, including data distribution, signal processing domain, number of sensors, and analysis window size, is imperative to refine tremor detection methodologies. Leveraging the PD-BioStampRC21 dataset with accelerometer recordings, our state-of-the-art deep learning architecture extracts a PD biomarker. Applying Fast Fourier Transform (FFT) magnitude coefficients as a preprocessing step improves PD detection in Leave-One-Subject-Out Cross-Validation (LOSO CV), achieving 66.90% accuracy with a single sensor and 6.4-second windows, compared to 60.33% using raw samples. Integrating information from all five sensors boosts performance to 75.10%. Window size analysis shows that 3.2- second windows of FFT coefficients from all sensors outperform shorter or longer windows, with a window-level accuracy of 80.49% and a user-level accuracy of 93.55% in a LOSO scenario.




Parkinson’s Disease Detection Through Inertial Signals and Posture Insights

Archivo Digital UPM
  • Gil Martín, Manuel
  • Esteban Romero, Sergio
  • Fernández Martínez, Fernando
  • San Segundo Hernández, Rubén
In the development of deep learning systems aimed at detecting Parkinson's Disease (PD) using inertial sensors, some aspects could be essential to refine tremor detection methodologies in realistic scenarios. This work analyses the effect of the subjects’ posture during tremor recordings and the required amount of data to assess a proper PD detection in a Leave-One-Subject-Out Cross-Validation (LOSO CV) scenario. We propose a deep learning architecture that learns a PD biomarker from accelerometer signals to classify subjects between healthy and PD patients. This study uses the PD-BioStampRC21 dataset, containing accelerometer recordings from healthy and PD participants equipped with five inertial sensors. An increment of performance was obtained when using sitting windows compared to using lying windows for Fast Fourier Transform (FFT) input signal domain. Moreover, using 5 minutes per subject could be sufficient to properly evaluate the PD status of a patient without losing performance, reaching a windowlevel accuracy of 77.71 ± 1.07 % and a user-level accuracy of 87.10 ± 11.80 %. Furthermore, a knowledge transfer could be performed when training the system with sitting instances and testing with lying examples, indicating that the sitting activity contains valuable information that allows an effective generalization to lying instances.




Overview of Robust and Multilingual Automatic Evaluation Metrics for Oopen-Domain Dialogue Systems at DSTC 11 Track 4

Archivo Digital UPM
  • Rodríguez Cantelar, Mario
  • Zhang, Chen
  • Tang, Chengguang
  • Shi, Ke
  • Ghazarian, Sarik
  • Sedoc, João
  • D'Haro Enríquez, Luis Fernando
  • Rudnicky, Alexander
The advent and fast development of neural networks have revolutionized the research on dialogue systems and subsequently have triggered various challenges regarding their automatic evaluation. Automatic evaluation of open-domain dialogue systems as an open challenge has been the center of the attention of many researchers. Despite the consistent efforts to improve automatic metrics’ correlations with human evaluation, there have been very few attempts to assess their robustness over multiple domains and dimensions. Also, their focus is mainly on the English language. All of these challenges prompt the development of automatic evaluation metrics that are reliable in various domains, dimensions, and languages. This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics. This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks.




Multimodal Audio-Language Model for Speech Emotion Recognition

Archivo Digital UPM
  • Bellver Soler, Jaime
  • Martín Fernández, Iván
  • Bravo Pacheco, Jose Manuel
  • Esteban Romero, Sergio
  • Fernández Martínez, Fernando
  • D'Haro Enríquez, Luis Fernando
In this paper, we present an approach to speech emotion recognition (SER) leveraging advances in machine learning and audio processing, particularly through the integration of large language models (LLMs) with audio capabilities. Our proposed ar- chitecture combines an audio encoder, specifically the Whisper- large-v3 model [1], with LLMs Phi 1.5 [2] and Gemma 2b [3] to create a robust and effective system for categorizing emotions in speech. We compare the performance of our models against existing approaches, achieving outstanding results in speech emotion recognition. Our findings demonstrate the effectiveness of audio-language models (ALMs), with the Whisper-large-V3 and Gemma 2b combination outperforming other alternatives.




THAURUS: An Innovative Multimodal Chatbot Based on the Next Generation of Conversational AI

Archivo Digital UPM
  • Estecha Garitagoitia, Marcos Santiago
  • Rodríguez Cantelar, Mario
  • Garrachón Ruiz, Alfredo
  • Fernández García, Claudia Garoé
  • Esteban Romero, Sergio
  • Conforto López, Cristina
  • Saiz Fernández, Alberto
  • Fernández Salvador, Luis Fernando
  • D'Haro Enríquez, Luis Fernando
The next generation of conversational AI has brought incredible capabilities such as high contextuality, naturalness, multimodality, and extended knowledge, but also important challenges such as high user expectations, high latencies, large computational requirements, as well as more subtle problems such as mismatch on existing databases for fine-tuning purposes, difficulties for pre-trained LLMs models to handle dialogue interactions, and the integration of multimodal capabilities.

This paper describes the architecture, methodology, and results of our THAURUS chatbot developed for the Alexa Prize Socialbot Grand Challenge (SGC5). Our proposal relies on several innovative ideas to take advantage of existing LLMs to create engaging user experiences that are capable of handling real users in a scalable way and without compromising the competition rules. Different SotA dialogue generators were fine-tuned and incorporated to give variability and handling the wide range of topic conversations; we also developed mechanisms to control the quality of the responses (e.g., detecting and handling toxic interactions, keeping topic coherence, and increasing engagement by providing up-to-date information in a conversational style).

In addition, our system extends the capabilities of the Cobot architecture by incorporating modules to automatically generate images, provide voice cloning capabilities with fictional characters, serve contextual sounds for detected entities in the dialogue, better capitalization and punctuation capabilities, and to provide natural expressions of interest.

Finally, we also included a trained generative selector and a reference-free model for automatic evaluation of turns that could reduce latencies and complement the ranker’s capabilities to select the best generative answer.




Evaluating emotional and subjective responses in synthetic art-related dialogues: A multi-stage framework with large language models

Archivo Digital UPM
  • Luna Jiménez, Cristina
  • Gil Martín, Manuel
  • D'Haro Enríquez, Luis Fernando
  • Fernández Martínez, Fernando
  • San Segundo Hernández, Rubén
The appearance of Large Language Models (LLM) has implied a qualitative step forward in the performance of conversational agents, and even in the generation of creative texts. However, previous applications of these models in generating dialogues neglected the impact of ‘hallucinations’ in the context of generating synthetic dialogues, thus omitting this central aspect in their evaluations. For this reason, we propose an open-source and flexible framework called GenEvalGPT framework: a comprehensive multi-stage evaluation strategy utilizing diverse metrics. The objective is two-fold: first, the goal is to assess the extent to which synthetic dialogues between a chatbot and a human align with the specified commands, determining the successful creation of these dialogues based on the provided specifications; and second, to evaluate various aspects of emotional and subjective responses. Assuming that dialogues to be evaluated were synthetically produced from specific profiles, the first evaluation stage utilizes LLMs to reconstruct the original templates employed in dialogue creation. The success of this reconstruction is then assessed in a second stage using lexical and semantic objective metrics. On the other hand, crafting a chatbot’s behaviors demands careful consideration to encompass a diverse range of interactions it is meant to engage in. Synthetic dialogues play a pivotal role in this context, as they can be deliberately synthesized to emulate various behaviors. This is precisely the objective of the third stage: evaluating whether the generated dialogues adhere to the required aspects concerning emotional and subjective responses. To validate the capabilities of the proposed framework, we applied it to recognize whether the chatbot exhibited one of two distinct behaviors in the synthetically generated dialogues: being emotional and providing subjective responses, or remaining neutral. This evaluation will encompass traditional metrics and automatic metrics generated by the LLM. In our use case of art-related dialogues, our findings reveal that the capacity to recover templates or profiles is more effective for information or profile items that are objective and factual, in contrast to those related to mental states or subjective facts. For the emotional and subjective behavior assessment, rule-based metrics achieved a 79% of accuracy in detecting emotions or subjectivity (anthropic), and an 82% on the LLM automatic metrics. The combination of these metrics and stages could help to decide which of the generated dialogues should be maintained depending on the applied policy, which could vary from preserving between 57% to 93% of the initial




Ethical Foresight of Artificial Consciousness: The Case for a “Conscious Copilot”

Ethical Foresight of Artificial Consciousness: The Case for a “Conscious Copilot”-->
Archivo Digital UPM
  • Salgado Criado, Jesús
  • Elamrani, Aida
  • Fernández Aller, María Celia
This article serves a twofold purpose. First, it introduces a potential technological application inspired by ongoing research in artificial consciousness and guided by ethical values. Second, it conducts a preliminary ethical evaluation of this proposed technological advancement, aligning with the concept of ethical foresight.

The envisioned product, termed "Conscious Copilot" (CoCo) for the purposes of this article, aims to augment human awareness of both the physical and digital environments surrounding the user. Inspired by Graziano's attention schema theory, which links social competences like empathy to models of others' awareness, CoCo acts as an impartial observer, leveraging its unique vantage point to provide valuable insights. These insights include broader situational awareness, identification of potential biases, detection of fake content, and recognition of hidden agendas and both explicit and tacit intentions. Ultimately, CoCo strives to enhance user´s agency by facilitating a more conscious understanding of their situation.

The development process of this ethical foresight exercise commences with a declaration outlining the core values that would guide CoCo's initial conception, such as trustworthiness, good governance, frugality, and an ethics of care. This approach emphasizes the integration of ethical aspirations, rather than solely ethical limitations (guardrails), into the research and development of technological artifacts. Following a detailed definition of CoCo's functionalities, interactions, and two illustrative use cases, a dedicated section will address potential ethical controversies, which represents a more conventional approach to ethics foresight.

This paper aspires to serve as an exemplar of an ethical foresight cycle. It goes beyond mere critical analysis of a technology's potential drawbacks and actively contributes to the design of a morally desirable technological artifact.
Proyecto: EC/HE/101071191




A dataset of synthetic art dialogues with ChatGPT

Archivo Digital UPM
  • Gil Martín, Manuel
  • Luna Jiménez, Cristina
  • Esteban Romero, Sergio
  • Estecha Garitagoitia, Marcos Santiago
  • Fernández Martínez, Fernando
  • D'Haro Enríquez, Luis Fernando
This paper introduces Art_GenEvalGPT, a novel dataset of synthetic dialogues centered on art generated through ChatGPT. Unlike existing datasets focused on conventional art-related tasks, Art_GenEvalGPT delves into nuanced conversations about art, encompassing a wide variety of artworks, artists, and genres, and incorporating emotional interventions, integrating speakers’ subjective opinions and different roles for the conversational agents (e.g., teacher-student, expert guide, anthropic behavior or handling toxic users). Generation and evaluation stages of GenEvalGPT platform are used to create the dataset, which includes 13,870 synthetic dialogues, covering 799 distinct artworks, 378 different artists, and 26 art styles. Automatic and manual assessment proof the high quality of the synthetic dialogues generated. For the profile recovery, promising lexical and semantic metrics for objective and factual attributes are offered. For subjective attributes, the evaluation for detecting emotions or subjectivity in the interventions achieves 92% of accuracy using LLM-self assessment metrics.




Dual Leap Motion Controller 2: A Robust Dataset for Multi-view Hand Pose Recognition

Archivo Digital UPM
  • Gil Martín, Manuel
  • Marini, Marco
  • San Segundo Hernández, Rubén
  • Cinque, Luigi
This paper presents Multi-view Leap2 Hand Pose Dataset (ML2HP Dataset), a new dataset for hand pose recognition, captured using a multi-view recording setup with two Leap Motion Controller 2 devices. This dataset encompasses a diverse range of hand poses, recorded from different angles to ensure comprehensive coverage. The dataset includes real images with the associated precise and automatic hand properties, such as landmark coordinates, velocities, orientations, and finger widths. This dataset has been meticulously designed and curated to maintain a balance in terms of subjects, hand poses, and the usage of right or left hand, ensuring fairness and parity. The content includes 714,000 instances from 21 subjects of 17 different hand poses (including real images and 247 associated hand properties). The multi-view setup is necessary to mitigate hand occlusion phenomena, ensuring continuous tracking and pose estimation required in real human-computer interaction applications. This dataset contributes to advancing the field of multimodal hand pose recognition by providing a valuable resource for developing advanced artificial intelligence human computer interfaces.




Larger encoders, smaller regressors: exploring label dimensionality reduction and multimodal large language models as feature extractors for predicting social perception

Archivo Digital UPM
  • Martín Fernández, Iván
  • Esteban Romero, Sergio
  • Bellver Soler, Jaime
  • Fernández Martínez, Fernando
  • Gil Martín, Manuel
Designing reliable automatic models for social perception can contribute to a better understanding of human behavior, enabling more trustworthy experiences in the multimedia on-line communication environment. However, predicting social attributes from video data remains challenging due to the complex interplay of visual, auditory, and linguistic cues. In this paper, we address this challenge by investigating the effectiveness of Multimodal Large Language Models (MM-LLMs) for feature extraction in the MuSe-Perception challenge. Firstly, our analysis of the novel LMU-ELP dataset has revealed high correlations between certain perceptual dimensions, motivating using a single regression model for all 16 social attributes to be predicted for a set of speakers appearing in recorded video clips. We demonstrate that dimensionality reduction through Principal Component Analysis (PCA) can be applied to the label space without a relevant performance loss. Secondly, by employing frozen MM-LLMs as feature extractors, we explore their ability to capture perception-related information. We extract sequence embeddings from the Qwen-VL and Qwen-Audio models and train a MultiLayer Perceptron over the attention-pooled vectors for each one of the encoders, obtaining a mean Pearson correlation of 0.22 using the average predictions for both models. Our best result of 0.31 is achieved by training the same architecture over the baseline vit-ver and w2v-msp features, which motivates further exploration on how to effectively leverage advanced MM-LLMs as feature extractors. Lastly, a post hoc analysis of our results highlights the limitations of Pearson correlation for evaluating regression performance in this context. In particular, a similar Pearson coefficient can be obtained with two very different prediction sets displaying different levels of variability. We take this result as a call to action in exploring alternative metrics to assess the regression performance for the task.




Multilingual Speech Emotion Recognition combining Audio and Text Information using Small Language Models

Archivo Digital UPM
  • Bellver Soler, Jaime
  • Rodríguez Cantelar, Mario
  • Estecha Garitagoitia, Marcos Santiago
  • Córdoba Herralde, Ricardo de
  • D'Haro Enríquez, Luis Fernando
In this work, we present a multimodal Small Language Model (SLM) architecture designed for multilingual Speech Emotion Recognition (SER). Our approach integrates a transformer- based audio encoder with a SLM, using a linear projection layer that bridges audio inputs with textual comprehension. This in- tegration enables the SLM to effectively process and understand spoken language, enhancing its capability to recognize emo- tional nuances. We experiment with various state-of-the-art (SoTA) SLMs and evaluate them across five different datasets representing a variety of European languages: German, Por- tuguese, Italian, Spanish, and English. By leveraging both au- dio signals and their corresponding transcriptions, our model achieves comparable performance in SER tasks for each lan- guage with respect to SoTA models. Our results demonstrate the robustness of our archit




Frequency analysis and transfer learning across different body sensor locations in parkinson’s disease detection using inertial signals

Archivo Digital UPM
  • Rey Díaz, Alejandro
  • Martín Fernández, Iván
  • San Segundo Hernández, Rubén
  • Gil Martín, Manuel
A detailed analysis of the inertial signals input is required when using deep learning models for Parkinson’s Disease detection. This work explores the possibility of reducing the input size of the models by studying the most appropriate frequency range and determines if it is feasible to evaluate subjects with different sensor locations than those used during training. For experimentation, 3.2 s windows are used to classify signals between Parkinson’s patients and control subjects, applying Fast Fourier Transform to the inertial signals and following a Leave-One-Subject-Out Cross-Validation methodology for the PD-BioStampRC21 dataset. It has been observed that the frequency range of 0 to 5 Hz offers a classification accuracy rate of 75.75 ± 0.62% using the five available sensors for training and evaluation, which is close to the model’s performance over the entire frequency range, from 0 to 15.625 Hz, which is 77.46 ± 0.60%. Regarding the transfer learning between sensors located in different body parts, it was observed that training and evaluating the model using data from the right forearm resulted in an accuracy of 65.17 ± 0.69%. When the model was trained with data from the opposite forearm, the accuracy was similar, at 63.57 ± 0.69%. Likewise, comparable results were found when using data from the other forearm and when training and evaluating with opposite thighs, with accuracy reductions not exceeding 3%.




Transformer-based prediction of hospital readmissions for diabetes patients

Archivo Digital UPM
  • García Mosquera, Jorge
  • Villa Monedero, María
  • Gil Martí­n, Manuel
  • San Segundo Hernández, Rubén
Artificial intelligence is having a strong impact on healthcare services, improving their quality and efficiency. This paper proposes and evaluates a prediction system of hospital readmissions for diabetes patients. This system is based on a Transformer, a state-of-the-art deep learning architecture integrating different types of information and features in the same model. This architecture integrates several attention heads to model the contribution of each feature to the global prediction. The main target of this work is to provide a decision support tool to help manage hospital resources effectively. This system was developed and evaluated using the United States Health Facts Database, which includes information and features from 101,766 diabetes patients between 1999 and 2008. The experiments were conducted using a patient-wise cross-validation strategy, ensuring that the patients used to develop the system were not used in the final test. These experiments demonstrated the Transformer’s strong ability to combine different features, providing slightly better results compared to previous results reported on this dataset. These experiments allow us to report the prediction accuracy for multiple class numbers. Finally, this paper provides a detailed analysis of the relevance of each feature when predicting hospital readmissions.




Overview of the ninth dialog system technology challenge: DSTC9

Archivo Digital UPM
  • Gunasekara, Chulaka
  • Kim, Seokhwan
  • D'Haro Enríquez, Luis Fernando
  • Rastogi, Abhinav
  • Chen, Yun-Nung
  • Eric, Mihail
  • Hedayatnia, Behnam
  • Gopalakrishnan, Karthik
  • Liu, Yang
  • Huang, Chao-Wei
  • Hakkani Tur, Dilek
  • Li, Jinchao
  • Zhu, Qi
  • Luo, Lingxiao
  • Liden, Lars
  • Huang, Kaili
  • Shayandeh, Shahin
  • Liang, Runze
  • Peng, Baolin
  • Zhang, Zheng
  • Shukla, Swadheen
  • Huang, Minlie
  • Gao, Jianfeng
  • Mehri, Shikib
  • Feng, Yulan
  • Gordon, Carla
  • Alavi, Seyed Hossein
  • Traum, David
  • Eskenazi, Maxine
  • Beirami, Ahmad
  • Cho, Eunjoon
  • Crook, Paul A.
  • De, Ankita
  • Geramifard, Alborz
  • Satwik, Kottur
  • Moon, Seungwhan
  • Poddar, Shivani
  • Subba, Rajen
This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with Unstructured Knowledge Access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog and 4. Situated interactive multimodal dialog. This paper describes the task definition, provided datasets, baselines, and evaluation setup for each track. We also summarize the results of the submitted systems to highlight the general trends of the state-of-the-art technologies for the tasks.