Publicaciones

Resultados totales (Incluyendo duplicados): 12
Encontrada(s) 1 página(s)

Art-GenEvalGPT

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • D'Haro Enríquez, Luis Fernando
  • Gil Martín, Manuel
  • Luna Jiménez, Cristina
  • Esteban Romero, Sergio
  • Estecha Garitagoitia, Marcos
  • Bellver Soler, Jaime
  • Fernández Martínez, Fernando
<p></p>
<p>Description of the project</p>
<p></p>
<p>ASTOUND is an EIC funded project (No. 101071191) under the HORIZON-EIC-2021-PATHFINDERCHALLENGES-01 call. </p>
<p></p>
<p>The aim of the project is to develop an artificial conscious AI based on the Attention Schema Theory (AST) proposed by Michel Graziano. This theory proposes that consciousness arises from the brain's ability to create and maintain a simplified model of its own processing, particularly focusing attention on certain aspects of its internal and external environment. </p>
<p></p>
<p>The project entails creating an AI system capable of exhibiting consciousness-like behaviours by implementing principles from the AST. This involves constructing a model that simulates attentional processes, allowing the AI to prioritise and focus on relevant information while disregarding irrelevant stimuli. </p>
<p></p>
<p>The ASTOUND project will provide an Integrative Approach for Awareness Engineering to establish consciousness in machines, and targeting the following goals: </p>
<p></p>
<p>Develop an AI architecture for Artificial Consciousness based on the Attention Schema Theory (AST) through an internal model of the state of the attention. </p>
<p></p>

<p>Implement the proposed architecture into a contextually aware virtual agent and prove improved performance thanks to the Attention Schema; for instance, by providing coherent discussion, self-regulation, short-and-long term memory, personalisation capabilities. </p>
<p></p>
<p>Define novel ways to measure the presence and level of consciousness in both humans and machines.</p>
<p></p>
<p></p>
<p>Description of the dataset</p>
<p></p>
<p>The dataset includes synthetic dialogues in the art domain that can be used for training a chatbot to discuss artworks within a museum setting. Leveraging Large Language Models (LLMs), particularly ChatGPT, the dataset comprises over 13,000 dialogues generated using prompt-engineering techniques. The dialogues cover a wide range of user and chatbot behaviours, including expert guidance, tutoring, and handling toxic user interactions. </p>
<p></p>
<p>The ArtEmis dataset serves as a basis, containing emotion attributions and explanations for artworks sourced from the WikiArt website. From this dataset, 800 artworks were selected based on consensus among human annotators regarding elicited emotions, ensuring balanced representation across different emotions. However, an imbalance in art styles distribution was noted due to the emphasis on emotional balance. </p>
<p></p>
<p>Each dialogue is uniquely identified using a "DIALOGUE_ID", encoding information about the artwork discussed, emotions, chatbot behaviour, and more. The dataset is structured into multiple files for efficient navigation and analysis, including metadata, prompts, dialogues, and metrics. </p>
<p></p>
<p>Objective evaluation of the generated dialogues was conducted, focusing on profile discrimination, anthropic behaviour detection, and toxicity evaluation. Various syntactic and semantic-based metrics are employed to assess dialogue quality, along with sentiment and subjectivity analysis. Tools like the MS Azure Content Moderator API, Detoxify library and LlamaGuard aid in toxicity evaluation. </p>
<p></p>
<p>The dataset's conclusion highlights the need for further work to handle biases, enhance toxicity detection, and incorporate multimodal information and contextual awareness. Future efforts will focus on expanding the dataset with additional tasks and improving chatbot capabilities for diverse scenarios.</p>
Proyecto: EC/HE/101071191




Interpreting Sign Language Recognition Using Transformers and MediaPipe Landmarks

Archivo Digital UPM
  • Luna Jiménez, Cristina
  • Gil Martín, Manuel
  • Kleinlein, Ricardo
  • San Segundo Hernández, Rubén
  • Fernández Martínez, Fernando
Sign Language Recognition (SLR) is a challenging task that aims to bridge the communication gap between the deaf and hearing communities. In recent years, deep learning-based approaches have shown promising results in SLR. However, the lack of interpretability remains a significant challenge. In this paper, we seek to understand which hand and pose MediaPipe Landmarks are deemed the most important for prediction as estimated by a Transformer model. We propose to embed a learnable array of parameters into the model that performs an element-wise multiplication of the inputs. This learned array highlights the most informative input features that contributed to solve the recognition task. Resulting in a human-interpretable vector that lets us interpret the model predictions. We evaluate our approach on public datasets called WLASL100 (SRL) and IPNHand (gesture recognition). We believe that the insights gained in this way could be exploited for the development of more efficient SLR pipelines.




Sign Language Dataset for Automatic Motion Generation

Archivo Digital UPM
  • Villa Monedero, María
  • Gil Martín, Manuel
  • Sáez Trigueros, Daniel
  • Pomirski, Andrzej
  • San Segundo Hernández, Rubén
Several sign language datasets are available in the literature. Most of them are designed for sign language recognition and translation. This paper presents a new sign language dataset for automatic motion generation. This dataset includes phonemes for each sign (specified in HamNoSys, a transcription system developed at the University of Hamburg, Hamburg, Germany) and the corresponding motion information. The motion information includes sign videos and the sequence of extracted landmarks associated with relevant points of the skeleton (including face, arms, hands, and fingers). The dataset includes signs from three different subjects in three different positions, performing 754 signs including the entire alphabet, numbers from 0 to 100, numbers for hour specification, months, and weekdays, and the most frequent signs used in Spanish Sign Language (LSE). In total, there are 6786 videos and their corresponding phonemes (HamNoSys annotations). From each video, a sequence of landmarks was extracted using MediaPipe. The dataset allows training an automatic system for motion generation from sign language phonemes. This paper also presents preliminary results in motion generation from sign phonemes obtaining a Dynamic Time Warping distance per frame of 0.37.




Sign Language Motion Generation from Sign Characteristics

Archivo Digital UPM
  • Gil Martín, Manuel
  • Villa Monedero, María
  • Pomirski, Andrzej
  • Sáez Trigueros, Daniel
  • San Segundo Hernández, Rubén
This paper proposes, analyzes, and evaluates a deep learning architecture based on transformers for generating sign language motion from sign phonemes (represented using HamNoSys: a notation system developed at the University of Hamburg). The sign phonemes provide information about sign characteristics like hand configuration, localization, or movements. The use of sign phonemes is crucial for generating sign motion with a high level of details (including finger extensions and flexions). The transformer-based approach also includes a stop detection module for predicting the end of the generation process. Both aspects, motion generation and stop detection, are evaluated in detail. For motion generation, the dynamic time warping distance is used to compute the similarity between two landmarks sequences (ground truth and generated). The stop detection module is evaluated considering detection accuracy and ROC (receiver operating characteristic) curves. The paper proposes and evaluates several strategies to obtain the system configuration with the best performance. These strategies include different padding strategies, interpolation approaches, and data augmentation techniques. The best configuration of a fully automatic system obtains an average DTW distance per frame of 0.1057 and an area under the ROC curve (AUC) higher than 0.94.




Improving Hand Pose Recognition using Localization and Zoom Normalizations over MediaPipe Landmarks

Archivo Digital UPM
  • Remiro Pérez, Miguel Ángel
  • Gil Martín, Manuel
  • San Segundo Hernández, Rubén
Hand Pose Recognition presents significant challenges that need to be addressed, such as varying lighting conditions or complex backgrounds, which can hinder accurate and robust hand pose estimation. This can be mitigated by employing MediaPipe to facilitate the efficient extraction of representative landmarks from static images combined with the use of Convolutional Neural Networks. Extracting these landmarks from the hands mitigates the impact of lighting variability or the presence of complex backgrounds. However, the variability of the location and size of the hands is still not addressed by this process. Therefore, the use of processing modules to normalize these points regarding the location of the wrist and the zoom of the hands can significantly mitigate the effects of these variabilities. In all the experiments performed in this work based on American Sign Language alphabet datasets of 870, 27,000, and 87,000 images, the application of the proposed normalizations has resulted in significant improvements in the model performance in a resource-limited scenario. Particularly, under conditions of high variability applying both normalizations resulted in a performance increment of 45.08 %, increasing the accuracy from 43.94 ± 0.64 % to 89.02 ± 0.40 %.
Proyecto: EC/HE/101071191




Automatic Detection of Inconsistencies and Hierarchical Topic Classification for Open-Domain Chatbots

Automatic Detection of Inconsistencies and Hierarchical Topic Classification for Open-Domain Chatbots-->
Archivo Digital UPM
  • Rodríguez Cantelar, Mario
  • Estecha Garitagoitia, Marcos Santiago
  • D'Haro Enríquez, Luis Fernando
  • Matía Espada, Fernando
  • Córdoba Herralde, Ricardo de
Current State-of-the-Art (SotA) chatbots are able to produce high-quality sentences, handling different conversation topics and larger interaction times. Unfortunately, the generated responses depend greatly on the data on which they have been trained, the specific dialogue history and current turn used for guiding the response, the internal decoding mechanisms, and ranking strategies, among others. Therefore, it may happen that for semantically similar questions asked by users, the chatbot may provide a different answer, which can be considered as a form of hallucination or producing confusion in long-term interactions. In this research paper, we propose a novel methodology consisting of two main phases: (a) hierarchical automatic detection of topics and subtopics in dialogue interactions using a zero-shot learning approach, and (b) detecting inconsistent answers using k-means and the Silhouette coefficient. To evaluate the efficacy of topic and subtopic detection, we use a subset of the DailyDialog dataset and real dialogue interactions gathered during the Alexa Socialbot Grand Challenge 5 (SGC5). The proposed approach enables the detection of up to 18 different topics and 102 subtopics. For the purpose of detecting inconsistencies, we manually generate multiple paraphrased questions and employ several pre-trained SotA chatbot models to generate responses. Our experimental results demonstrate a weighted F-1 value of 0.34 for topic detection, a weighted F-1 value of 0.78 for subtopic detection in DailyDialog, then 81% and 62% accuracy for topic and subtopic classification in SGC5, respectively. Finally, to predict the number of different responses, we obtained a mean squared error (MSE) of 3.4 when testing smaller generative models and 4.9 in recent large language models.




Video Memorability Prediction From Jointly-learnt Semantic and Visual Features

Archivo Digital UPM
  • Martín Fernández, Iván
  • Kleinlein, Ricardo
  • Luna Jiménez, Cristina
  • Gil Martín, Manuel
  • Fernández Martínez, Fernando
The memorability of a video is defined as an intrinsic property of its visual features that dictates the fraction of people who recall having watched it on a second viewing within a memory game. Still, unravelling what are the key features to predict memorability remains an obscure matter. This challenge is addressed here by fine-tuning text and image encoders using a cross-modal strategy known as Contrastive Language-Image Pre-training (CLIP). The resulting video-level data representations learned include semantics and topic-descriptive information as observed from both modalities, hence enhancing the predictive power of our algorithms. Our proposal achieves in the text domain a significantly greater Spearman Rank Correlation Coefficient (SRCC) than a default pre-trained text encoder (0.575 ± 0.007 and 0.538 ± 0.007, respectively) over the Memento10K dataset. A similar trend, although less pronounced, can be noticed in the visual domain.We believe these findings signal the potential benefits that cross-modal predictive systems can extract from being fine-tuned to the specific issue of media memorability.
Proyecto: EC/HE/101071191




A Comprehensive Analysis of Parkinson’s Disease Detection Through Inertial Signal Processing

Archivo Digital UPM
  • Gil Martín, Manuel
  • Esteban Romero, Sergio
  • Fernández Martínez, Fernando
  • San Segundo Hernández, Rubén
When developing deep learning systems for Parkinson's Disease (PD) detection using inertial sensors, a comprehensive analysis of some key factors, including data distribution, signal processing domain, number of sensors, and analysis window size, is imperative to refine tremor detection methodologies. Leveraging the PD-BioStampRC21 dataset with accelerometer recordings, our state-of-the-art deep learning architecture extracts a PD biomarker. Applying Fast Fourier Transform (FFT) magnitude coefficients as a preprocessing step improves PD detection in Leave-One-Subject-Out Cross-Validation (LOSO CV), achieving 66.90% accuracy with a single sensor and 6.4-second windows, compared to 60.33% using raw samples. Integrating information from all five sensors boosts performance to 75.10%. Window size analysis shows that 3.2- second windows of FFT coefficients from all sensors outperform shorter or longer windows, with a window-level accuracy of 80.49% and a user-level accuracy of 93.55% in a LOSO scenario.




Parkinson’s Disease Detection Through Inertial Signals and Posture Insights

Archivo Digital UPM
  • Gil Martín, Manuel
  • Esteban Romero, Sergio
  • Fernández Martínez, Fernando
  • San Segundo Hernández, Rubén
In the development of deep learning systems aimed at detecting Parkinson's Disease (PD) using inertial sensors, some aspects could be essential to refine tremor detection methodologies in realistic scenarios. This work analyses the effect of the subjects’ posture during tremor recordings and the required amount of data to assess a proper PD detection in a Leave-One-Subject-Out Cross-Validation (LOSO CV) scenario. We propose a deep learning architecture that learns a PD biomarker from accelerometer signals to classify subjects between healthy and PD patients. This study uses the PD-BioStampRC21 dataset, containing accelerometer recordings from healthy and PD participants equipped with five inertial sensors. An increment of performance was obtained when using sitting windows compared to using lying windows for Fast Fourier Transform (FFT) input signal domain. Moreover, using 5 minutes per subject could be sufficient to properly evaluate the PD status of a patient without losing performance, reaching a windowlevel accuracy of 77.71 ± 1.07 % and a user-level accuracy of 87.10 ± 11.80 %. Furthermore, a knowledge transfer could be performed when training the system with sitting instances and testing with lying examples, indicating that the sitting activity contains valuable information that allows an effective generalization to lying instances.




Overview of Robust and Multilingual Automatic Evaluation Metrics for Oopen-Domain Dialogue Systems at DSTC 11 Track 4

Archivo Digital UPM
  • Rodríguez Cantelar, Mario
  • Zhang, Chen
  • Tang, Chengguang
  • Shi, Ke
  • Ghazarian, Sarik
  • Sedoc, João
  • D'Haro Enríquez, Luis Fernando
  • Rudnicky, Alexander
The advent and fast development of neural networks have revolutionized the research on dialogue systems and subsequently have triggered various challenges regarding their automatic evaluation. Automatic evaluation of open-domain dialogue systems as an open challenge has been the center of the attention of many researchers. Despite the consistent efforts to improve automatic metrics’ correlations with human evaluation, there have been very few attempts to assess their robustness over multiple domains and dimensions. Also, their focus is mainly on the English language. All of these challenges prompt the development of automatic evaluation metrics that are reliable in various domains, dimensions, and languages. This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics. This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks.




Multimodal Audio-Language Model for Speech Emotion Recognition

Archivo Digital UPM
  • Bellver Soler, Jaime
  • Martín Fernández, Iván
  • Bravo Pacheco, Jose Manuel
  • Esteban Romero, Sergio
  • Fernández Martínez, Fernando
  • D'Haro Enríquez, Luis Fernando
In this paper, we present an approach to speech emotion recognition (SER) leveraging advances in machine learning and audio processing, particularly through the integration of large language models (LLMs) with audio capabilities. Our proposed ar- chitecture combines an audio encoder, specifically the Whisper- large-v3 model [1], with LLMs Phi 1.5 [2] and Gemma 2b [3] to create a robust and effective system for categorizing emotions in speech. We compare the performance of our models against existing approaches, achieving outstanding results in speech emotion recognition. Our findings demonstrate the effectiveness of audio-language models (ALMs), with the Whisper-large-V3 and Gemma 2b combination outperforming other alternatives.




THAURUS: An Innovative Multimodal Chatbot Based on the Next Generation of Conversational AI

Archivo Digital UPM
  • Estecha Garitagoitia, Marcos Santiago
  • Rodríguez Cantelar, Mario
  • Garrachón Ruiz, Alfredo
  • Fernández García, Claudia Garoé
  • Esteban Romero, Sergio
  • Conforto López, Cristina
  • Saiz Fernández, Alberto
  • Fernández Salvador, Luis Fernando
  • D'Haro Enríquez, Luis Fernando
The next generation of conversational AI has brought incredible capabilities such as high contextuality, naturalness, multimodality, and extended knowledge, but also important challenges such as high user expectations, high latencies, large computational requirements, as well as more subtle problems such as mismatch on existing databases for fine-tuning purposes, difficulties for pre-trained LLMs models to handle dialogue interactions, and the integration of multimodal capabilities.

This paper describes the architecture, methodology, and results of our THAURUS chatbot developed for the Alexa Prize Socialbot Grand Challenge (SGC5). Our proposal relies on several innovative ideas to take advantage of existing LLMs to create engaging user experiences that are capable of handling real users in a scalable way and without compromising the competition rules. Different SotA dialogue generators were fine-tuned and incorporated to give variability and handling the wide range of topic conversations; we also developed mechanisms to control the quality of the responses (e.g., detecting and handling toxic interactions, keeping topic coherence, and increasing engagement by providing up-to-date information in a conversational style).

In addition, our system extends the capabilities of the Cobot architecture by incorporating modules to automatically generate images, provide voice cloning capabilities with fictional characters, serve contextual sounds for detected entities in the dialogue, better capitalization and punctuation capabilities, and to provide natural expressions of interest.

Finally, we also included a trained generative selector and a reference-free model for automatic evaluation of turns that could reduce latencies and complement the ranker’s capabilities to select the best generative answer.