GRESEL-UAM: NARRATIVAS FINANCIERAS Y LITERATURA

PID2023-151280OB-C21

Nombre agencia financiadora Agencia Estatal de Investigación
Acrónimo agencia financiadora AEI
Programa Programa Estatal para Impulsar la Investigación Científico-Técnica y su Transferencia
Subprograma Subprograma Estatal de Generación de Conocimiento
Convocatoria Proyectos de I+D+I (Generación de Conocimiento y Retos Investigación)
Año convocatoria 2023
Unidad de gestión Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023
Centro beneficiario UNIVERSIDAD AUTONOMA DE MADRID
Identificador persistente http://dx.doi.org/10.13039/501100011033

Publicaciones

Resultados totales (Incluyendo duplicados): 3
Encontrada(s) 1 página(s)

Trafalgar Neo4j Database

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Torterolo Orta, Yanco Amor
  • Roseti, Sofía Micaela
  • Moreno-Sandoval, Antonio
<p>The dataset is part of the project: GRESEL-UAM: About GRESEL: AI Generation Results Enriched with Simplified Explanations Based on Linguistic Features (Resultados de Generación de IA Enriquecidos con Explicaciones Simplificadas Basadas en Características Lingüísticas).</p>
<p>This dataset is part of the publication titled "Assessing a Literary RAG System with a Human-Evaluated Synthetic QA Dataset Generated by an LLM: Experiments with Knowledge Graphs," which will be presented in September 2025 in Zaragoza, within the framework of the conference of the Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN). The work has already been accepted for publication in SEPLN’s official journal, Procesamiento del Lenguaje Natural.</p>

<p>This dataset consists of the Trafalgar knowledge graph database, based on the novel by Benito Pérez Galdós and implemented in Neo4j. This database is used in the RAG experiments presented in the publication. As a Knowledge Graph (KG), it offers several advantages over conventional RAG approaches (which are explored in the paper). The database structures the text of the novel and links elements such as paragraphs and chapters, as well as named entities like character names, places, and ships. More information about its creation and structure can be found in the methodology section.</p>

<p>This is a neo4j.dump file, which contains an export of the Trafalgar database. This file can be used to replicate the database used in the experiments described in the paper.</p>




The Financial Document Causality Detection Shared Task (FinCausal 2025): Dataset

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Carbajo-Coronado, Blanca
  • Moreno-Sandoval, Antonio
  • Torterolo Orta, Yanco Amor
  • Gozalo, Paula
<p>The Financial Document Causality Detection Shared Task (FinCausal 2025) aims to improve causality identification in the financial domain through textual data.
This shared task focuses on determining causality associated with both events and quantified facts. In this task, a cause can be the justification of a statement or the reason explaining an outcome. Therefore, it is a relation detection task.
The main difference compared to the 2023 edition is that the task is framed as a Question Answering (QA) problem. The question is posed in an abstractive manner, while the predicted answer must be extractive. Additionally, the Semantic Answer Similarity (SAS) metric has been introduced.</p>
<p>Participants, given the context and the abstractive question, must extract the literal answer from the context that responds to that question. The questions seek causal-type relationships, either causes or effects.</p>
<p>The task dataset has been extracted from a corpus of Spanish financial annual reports from 2014 to 2018. Participants are provided with a CSV file containing the following fields: ID; Text; Question; Answer.</p>
<p>The standard way to participate is to fine-tune a model using the data annotated by linguists (including Inter-Annotator Agreement, IAA), and then use the fine-tuned model to predict the "ANSWER" field in the test set.</p>
<p>This publication refers to the dataset used in the competition.</p>
<p>This is a dataset from the FinCausal 2025 competition. It is designed for participants to use it to fine-tune their models and complete the task with the highest possible similarity to the gold standard, according to the established metrics.</p>
<p>It consists of texts annotated by linguists, where a context, an abstractive question, and its corresponding extractive answer—which addresses the causal nature of the question—are provided.</p>
<p>There are two versions available: one in English and one in Spanish.</p>




Synthetic datasets generated by Large Language Models

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Torterolo Orta, Yanco Amor
  • Roseti, Sofía Micaela
  • Moreno-Sandoval, Antonio
<p>This dataset is the result of the work done in the project GRESEL-UAM: About GRESEL: AI Generation Results Enriched with Simplified Explanations Based on Linguistic Features (Resultados de Generación de IA Enriquecidos con Explicaciones Simplificadas Basadas en Características Lingüísticas).</p>
<p>This dataset is part of the publication titled "Assessing a Literary RAG System with a Human-Evaluated Synthetic QA Dataset Generated by an LLM: Experiments with Knowledge Graphs," which will be presented in September 2025 in Zaragoza, within the framework of the conference of the Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN). The work has already been accepted for publication in SEPLN’s official journal, Procesamiento del Lenguaje Natural.</p>
<p>This dataset consists of three synthetically generated datasets, a process known as Synthetic Data Generation (SDG). We used three different LLMs: deepseek-r1:14b, llama3.1:8b-instruct-q8_0, and mistral:7b-instruct. Each was given a prompt instructing them to generate a question answering (QA) dataset based on context fragments from the novel Trafalgar by Benito Pérez Galdós.</p>
<p>These datasets were later used to evaluate a Retrieval-Augmented Generation (RAG) system.</p>
<p>Three CSV files are provided, each corresponding to the synthetic dataset generated by one of the models. In total, the dataset contains 359 items. The header includes the following fields: id, context, question, answer, and success. Fields are separated by tabs.</p>
<p>The id column is simply an identifier number. The context column contains the text fragment from which the model generated the questions and answers. The question and answer fields contain the generated questions and answers, respectively. The success column indicates whether the model successfully generated the question and answer in the corresponding fields ("yes" or "no").</p>