METODOS DE LA LINGUISTICA COMPUTACIONAL PARA LA LEGIBILIDAD Y SIMPLIFICACION AUTOMATICA EN NARRATIVA FINANCIERA (CLARA-FINT))

PID2020-116001RB-C31

Nombre agencia financiadora Agencia Estatal de Investigación
Acrónimo agencia financiadora AEI
Programa Programa Estatal de I+D+i Orientada a los Retos de la Sociedad
Subprograma Programa Estatal de I+D+i Orientada a los Retos de la Sociedad
Convocatoria Proyectos I+D
Año convocatoria 2020
Unidad de gestión Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020
Centro beneficiario UNIVERSIDAD AUTONOMA DE MADRID
Identificador persistente http://dx.doi.org/10.13039/501100011033

Publicaciones

Resultados totales (Incluyendo duplicados): 11
Encontrada(s) 1 página(s)

A Discourse Marker Tagger for Spanish using Transformers, Etiquetador automático de Marcadores Discursivos mediante Transformers

RUA. Repositorio Institucional de la Universidad de Alicante
  • García Toro, Ana
  • Porta Zamorano, Jordi
  • Moreno Sandoval, Antonio
We present an automatic discourse particle (DM) tagger developed using manual annotation and machine learning. The tagger has been developed on a dataset of financial letters, where human annotators have reached an 0.897 agreement rate (IAA) on the indications of a specific annotation guide. With the annotated dataset, a prototype has been developed using the pre-trained Transformers, adapting it to the task (fine-tunning), reaching an F1-score of 0.933. An evaluation of the results obtained by the tagger is included., Presentamos un etiquetador automático de partículas discursivas (DM) desarrollado mediante etiquetado manual y aprendizaje automático. El etiquetador se ha desarrollado en un dataset de cartas financieras. Las anotadoras humanas han alcanzado un 0,897 de tasa de acuerdo (IAA) sobre las indicaciones de una guía de anotación específica. Con el dataset anotado se ha desarrollado un prototipo usando modelos de Transformers pre-entrenados adaptándolos a la tarea (fine-tuning) con un F1 de 0,933. Al final se da una evaluación de los resultados obtenidos por el tagger., The research has been carried out within the CLARA-FINT project (PID2020-116001RB-C31), funded by the Spanish Ministry of Science and Innovation.




Los marcadores del discurso en la narrativa financiera: análisis de las cartas a los accionistas, Discourse markers in the financial narrative: analysis of letters to shareholders

RUA. Repositorio Institucional de la Universidad de Alicante
  • García Toro, Ana
Los marcadores del discurso (MD) son elementos de cohesión textual que guían las inferencias del discurso a nivel intra y extraoracional. Entre sus inferencias se registran la adición, la contraargumentación, el refuerzo, la atenuación o la reformulación, que intervienen constantemente en la argumentación y en la construcción del discurso. Por su parte, las cartas a los accionistas, redactadas por el primer ejecutivo de la empresa, son un resumen y evaluación del ejercicio anual de la compañía, donde se exponen los principales datos u objetivos conseguidos en ese año, con el propósito de mantener la confianza de sus accionistas y/o atraer potenciales inversores. En este trabajo se combinan un estudio cuantitativo, con la obtención de las frecuencias de los marcadores; un estudio cualitativo, porque se presenta un análisis de casos donde se podrán observar ejemplos que extrapolen los datos al texto; y un estudio contrastivo entre dos corpus de carácter financiero: los textos de empresas con ganancias y los de empresas con pérdidas. Así, se pretende exponer la distribución de los marcadores del discurso en un corpus financiero con el fin de determinar si el empleo y la frecuencia de algunas partículas discursivas en las cartas están condicionadas por los resultados de la empresa. Los principales hallazgos tras este análisis pueden resumirse en que en las ganancias van a primar las inferencias de adición de la información y de refuerzo argumentativo, mientras que, en las pérdidas se utilizan marcadores contraargumentativos y operadores de modalidad emotiva. Además, la apariencia de seguridad y esperanza se busca a través de los operadores modales, y encontramos un uso generalizado de la partícula adicionalmente, potencialmente por la influencia anglosajona. La originalidad de este trabajo recae en que se realiza en un campo inexplorado como es la narrativa financiera en español. En esencia, nuestro estudio combina herramientas propias de la lingüística computacional, con una metodología empírica de lingüística de corpus, para acercarnos un poco más al estudio de la argumentación y de las estrategias utilizadas en el discurso financiero., Discourse markers (DMs) are elements of textual cohesion that guide discourse inferences at the sentence level and outside of it. Their inferences include addition, counter-argumentation, reinforcement, attenuation, or reformulation, which are constantly involved in argumentation and discourse construction. For their part, letters to shareholders, written by the company’s chief executive, are a summary and evaluation of the company’s annual financial year, in which the main data or objectives achieved that year are set out, with the aim of maintaining the confidence of its shareholders and/or attracting potential investors. This work combines a quantitative study, with the obtaining of the frequencies of the markers; a qualitative study, because a case analysis is presented where can be observed examples and extrapolate the data to the text; and a contrastive study between two financial corpora: the texts of companies with profits and those of companies with losses. Thus, the aim is to expose the distribution of discourse markers in a financial corpus in order to determine whether the use and frequency of some discursive particles in the letters are conditioned by the company’s results. The main findings of this analysis can be summarised as follows: in the case of profits, information addition and argumentative reinforcement inferences are predominant, whereas in the case of losses, counter-argumentative markers and emotive modality operators are used. Moreover, the appearance of security and hope is sought through modal operators, and we find a generalised use of the particle additionally, potentially due to Anglo-Saxon influence. The originality of this work lies in the fact that it is carried out in the unexplored field of financial narrative in Spanish. In essence, our study combines tools from computational linguistics with an empirical methodology of corpus linguistics to get a little closer to the study of argumentation and the strategies used in financial discourse., Esta investigación ha sido financiada por el Ministerio de Ciencia e Innovación, dentro del marco del Proyecto CLARA-FINT. Código: PID2020-116001RB-C31.




The financial document causality detection shared task (FinCausal 2023): Dataset

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Moreno-Sandoval, Antonio
  • Carbajo-Coronado, Blanca
  • Porta, Jordi
<p>The Financial Document Causality Detection Task (FinCausal 2023) aims at improving the causality in the financial domain trough its texts. Participants are asked to identify, in causal sentences, which elements of the sentence relate to the cause, and which relate to the effect. LLI-UAM is the organizer of the Spanish subtask. The task dataset has been extracted from a corpus of Spanish financial annual reports from 2014 to 2018.
This shared task focuses on determining causality associated with both events or quantified facts. For this task, a cause can be the justification for a statement or the reason that explains a result. Therefore, it is a relationship detection task. The aim is to identify, in a paragraph, the causal elements and the consequential ones. Only one causal element and one effect are expected in each paragraph.</p>
<p>Participants are provided with a sample of paragraphs, labelled through inter-annotator agreement. This publication consists of the dataset of the shared task.</p>
<p>It is a dataset from the FinCausal 2023 competition. It is designed for participants to use the dataset for fine-tuning their models in order to complete the task with the highest possible similarity to the gold standard. It consists of texts annotated by linguists, highlighting the cause and effect present in a paragraph with a financial theme.</p>




SIMFIN: Simplificador y detector léxico financiero automático (aplicación web en Python)

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Torterolo Orta, Yanco Amor
  • Moreno-Sandoval, Antonio
<p>Este proyecto se trata de un TFM enmarcado en las prácticas del alumno Yanco Amor Torterolo Orta en el LLI-UAM. Pertenece al proyecto CLARA-FINT, que se centra en el lenguaje claro. Más específicamente, se exploran diversas técnicas de simplificación en el ámbito financiero. Esto está motivado por la complejidad de la realidad financiera y la carga semántica especializada de sus texto. Por tanto, se antoja necesario simplificar para poder difundir el mensaje a un público más amplio.</p>
<p>SIMFIN es un programa que pretende cumplir esta finalidad mediante la simplificación léxica de los términos (referidos como "unidades terminológicas", "UT") del ámbito financiero. Consiste en una página web de acceso al público (http://leptis.lllf.uam.es/simfin) en la que el usuario introduce su texto financiero/económico. Tras ello, el programa le devuelve el mismo texto pero etiquetado con colores para destacar las UT. Algunas de las UT ofrecen en una ventana pop-up su versión simplificada al pinchar con el ratón sobre ellas (azul oscuro), lo cual ocurre cuando la sustitución es multipalabra; y otras muestran la UT por el que ha sido sustituido directamente (verde), lo cual ocurre cuando la sustitución directa es viable debido a que solo están formadas por una palabra. Adicionalmente, destaca en amarillo tanto las UT de las listas (que aún no tienen sustitución) como posibles UT debido a su estructura, y en azul turquesa los anglicismos y palabras inglesas. El programa cuenta con una leyenda y un texto de prueba a modo explicativo para entenderlo todo con un ejemplo práctico.</p>

<p>El dataset:</p>
<p>Por un lado, está la carpeta con todos los archivos y estructura necesarios para hacer funcionar el programa. Cuenta con los siguientes archivos: app.py, es_dicc.db, static/icons/*.png, ut_detect.txt, en_dicc.db, sample.txt, templates/index.html, ut_sust.csv.</p>
<p>Es un programa mayormente escrito en Python, pero tiene elementos de HTML y CSS, que interactúan con Python gracias a la biblioteca Flask. No obstante, los archivos ut_detect.txt y ut_sust.csv no se adjuntan en esta publicación debido a que pertenecen a otra publicación. El primero es la lista de UT que se tienen que detectar y el segundo es la lista de UT que se tienen que simplificar.</p>
<p>Por otro lado, se proporciona acceso al TFM que describe la creación del programa.</p>




Automatic discourse markers extractor

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Moreno-Sandoval, Antonio
  • Porta, Jordi
  • García Toro, Ana
<p>This work is framed in the Spanish national project CLARA-FINT. The aim of this task within the project was to create an automatic discourse markers extractor for Spanish. In order to do so, the first step was to apply linguistic annotation on texts containing said markers. The next step involved the use of these annotations to fine-tune a model for the extraction of discourse markers task. This publication contains the fine-tuned model, i.e., the automatic extractor.</p>

<p>It is a roberta-bne model that was fine-tuned for the discourse markers extraction task. Texts annotated by linguists were used for fine-tuning as training data. Said texts contained discourse markers that were highlighted within their context.</p>




Financial ES-EN parallel corpus from annual reports

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Moreno-Sandoval, Antonio
  • Torterolo Orta, Yanco Amor
  • Roseti, Sofía Micaela
  • Carbajo-Coronado, Blanca
  • Porta, Jordi
<p>The creation of this dataset is framed in the Spanish national project CLARA-FINT. It is a dataset with parallel bilingual texts (EN-ES). These texts are the main Spanish listed Companies' annual reports. Usually, said reports are publicly available under their respective shareholders website sections. The creation of this dataset included the cleaning of these reports and its manual alignment in both languages. It served as a gold standard to fine-tune a model for this automatic alignment task. It is worth noting that this publication only contains translation units in CSV format. A column per language. A total of 5000 random and disarranged units were selected.</p>
<p>This is a CSV file containing the Spanish text in the first column, and the English equivalent text in the second one. These are equivalent because the English version is a translation from the Spanish. Regarding the original files, both versions are official and available in each company's website. In short, the file is a translation memory with 5000 translation units that were disarranged and randomized.</p>




Automatic financial term extractor

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Moreno-Sandoval, Antonio
  • Porta, Jordi
  • Carbajo-Coronado, Blanca
<p>The creation of this dataset is framed in the Spanish national project CLARA-FINT. The aim of this task within the project was to create an automatic financial term extractor for Spanish. In order to do so, the first step was to apply linguistic annotation on texts, namely annual reports from the main Spanish listed companies in the IBEX 35 index. The next step involved the use of these annotations to fine-tune a model for the financial term extraction task. This dataset contains the fine-tuned model, i.e., the automatic extractor. It is described in the paper PORTA-ZAMORANO, J., CARBAJO-CORONADO, B., MORENO-SANDOVAL, A. (2024).</p>
<p>It is a bert-multilingual model that was fine-tuned for the financial terms extraction task. Texts annotated by linguists were used for fine-tuning as training data. Said texts contained financial terms that were highlighted within their context.</p>




List of financial terms

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Carbajo-Coronado, Blanca
  • Moreno-Sandoval, Antonio
  • Porta, Jordi
<p>The creation of this dataset is framed in the Spanish national project CLARA-FINT. It is a dataset with financial texts from the main Spanish listed companies' annual reports. Usually, said reports are publicly available under their respective shareholders website sections. The creation of a manually-annotated gold standard by linguists is explored, as well as its subsequent use for fine-tuning a language model to further extract more terms automatically. These new terms are validated by humans afterwards and then incorporated to the definitive list of terms. There are 13,958 terms in total.</p>
<p>A list of terms in TXT format, with a term in each line. It contains financial terms that were extracted from the annual reports of the main Spanish companies listed in the IBEX 35 index. Firstly, a manual extraction is performed by annotation. This results in a gold standard. Secondly, this gold standard is employed to fine-tune a language model, which also generates new terms to add to the definitive list, upon human validation. There are in total 13,958 terms in the list.</p>




Discourse markers: Annotation guidelines

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Moreno-Sandoval, Antonio
  • Porta, Jordi
  • García Toro, Ana
<p>This work is framed in the Spanish national project CLARA-FINT. The aim of this task within the project was to create an automatic discourse markers extractor for Spanish. In order to do so, the first step was to create these Annotation Guidelines to apply linguistic annotation on texts containing said markers. The next step involved the use of these annotations to fine-tune a model for the extraction of discourse markers task. This publication contains said annotation guidelines.</p>
<p>It is a document that contains the rules and concepts necessary for linguistic annotation of discourse markers (annotation guidelines).</p>




The financial narrative summarisation shared task (FNS 2022 & 2023): Datasets

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Moreno-Sandoval, Antonio
  • Carbajo-Coronado, Blanca
<p>Financial Narrative Processing (FNP) consists of workshops organized by Lancaster University at international NLP conferences to address various aspects of automatic processing of financial narratives, including automatic summarization. The LLI-UAM participated in 2022 and 2023 by creating Spanish-language datasets for the FNS shared task (evaluating AI systems using the same dataset to compare different approaches).</p>
<p>The dataset consists of complete annual reports from companies, chairmen's letters (which are considered summaries of the reports), and a version created by linguists that consists of a summary of the chairmen's letters in fewer than 1,000 words. Based on the dataset, participants train their models to generate summaries similar to the chairman's letter or the simplified version for new evaluation reports that were not shared during training. The evaluation is conducted using the ROUGE metric.</p>
<p>The dataset is composed of 262 financial reports taken from the FinT-esp corpus. The reports were originally in PDF format and were converted into plain text, removing tables, footnotes, headers, and retaining only the narrative content. The length of the reports ranges from 40 to 400 pages, with an average of 36,285 words. A total of 262 chairman's letters were extracted, and an additional 262 summary documents were created, each containing fewer than 1,000 words. This publication is about the dataset from the 2022 and 2023 competition.</p>
<p>These are txt files containing the full report, their respective chairmen's letters, and the summaries of these letters. They belong to The Financial Narrative Summarisation Shared Task (2022 and 2023).</p>




Guía de anotación de terminología financiera (FINTERM)

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño
  • Carbajo-Coronado, Blanca
  • Moreno-Sandoval, Antonio
<p>The creation of this document of annotation guidelines is framed in the Spanish national project CLARA-FINT. It consists of annotation guidelines that establish some indications and rules to create a dataset. The dataset is made up of financial texts from the annual reports of the main Spanish listed companies. Usually, said reports are publicly available under their respective shareholders website sections. The creation of a manually annotated gold standard by linguists using these guidelines is explored.</p>
<p>Annotation guidelines. It is a PDF file containing a description of the task with some theory, along with a number of rules, cases, exceptions, etc. This was made for serving as a reference while annotating financial terms. The criteria were established based on the task.</p>