METODOS DE LA LINGUISTICA COMPUTACIONAL PARA LA LEGIBILIDAD Y SIMPLIFICACION AUTOMATICA DEL DISCURSO MEDICO

PID2020-116001RA-C33

Nombre agencia financiadora Agencia Estatal de Investigación
Acrónimo agencia financiadora AEI
Programa Programa Estatal de I+D+i Orientada a los Retos de la Sociedad
Subprograma Programa Estatal de I+D+i Orientada a los Retos de la Sociedad
Convocatoria Proyectos I+D
Año convocatoria 2020
Unidad de gestión Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020
Centro beneficiario AGENCIA ESTATAL CONSEJO SUPERIOR DE INVESTIGACIONES CIENTIFICAS (CSIC)
Identificador persistente http://dx.doi.org/10.13039/501100011033

Publicaciones

Resultados totales (Incluyendo duplicados): 11
Encontrada(s) 1 página(s)

Building a comparable corpus and a benchmark for Spanish medical text simplification, Construcción de un corpus comparable y un recurso de referencia para la simplificación de textos médicos en español

RUA. Repositorio Institucional de la Universidad de Alicante
  • Campillos Llanos, Leonardo
  • Terroba Reinares, Ana R.
  • Zakhir Puig, Sofía
  • Valverde, Ana
  • Capllonch-Carrión, Adrián
We report the collection of the CLARA-MeD comparable corpus, which is made up of 24 298 pairs of professional and simplified texts in the medical domain for the Spanish language (>96M tokens). Texts types range from drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words), abstracts of systematic reviews (8138 pairs of texts, >9M words), cancer-related information summaries (201 pairs of texts, >3M tokens) and clinical trials announcements (5748 pairs of texts, 451 690 words). We also report the alignment of professional and simplified sentences, conducted manually by pairs of annotators. A subset of 3800 sentence pairs (149 862 tokens) has been aligned each by 2 experts, with an average inter-annotator agreement kappa score of 0.839 (0.076). The data are available in the community and contributes with a new benchmark to develop and evaluate automatic medical text simplification systems., Se describe la recogida del corpus comparable CLARA-MeD, formado por 24 298 pares de textos profesionales y simplificados de dominio médico en lengua española (>96M palabras). Los tipos de textos varían desde prospectos médicos y fichas técnicas de medicamentos (10 211 pares de textos, >82M palabras), resúmenes de revisiones sistemáticas (8138 pares de textos, >9M palabras), resúmenes de información sobre el cáncer (201 pares de textos, >3M palabras) y anuncios de ensayos clínicos (5748 pares de textos, 451 690 palabras). También presentamos el alineamiento de frases técnicas y simplificadas, realizado a mano por pares de anotadores. Un subconjunto de 3800 pares de frases (149 862 tokens) se han emparejado, con un acuerdo medio entre anotadores con valor kappa = 0.839 (0.076). Los datos están disponibles en la comunidad y este nuevo recurso permite desarrollar y evaluar sistemas de simplificación automática de textos médicos., Project CLARA-MED (PID2020-116001RA-C33) funded by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”.




CLARA-MeD corpus

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
  • Terroba Reinares, Ana Rosa
  • Zakhir Puig, Sofía
  • Valverde Mateos, Ana
  • Capllonch Carrión, Adrián
A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens).
The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022., A collection of 24 298 pairs of professional and simplified texts (>96 million tokens) for automatic medical text simplification in Spanish.
A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) is released as a benchmark for medical text simplification.
This dataset was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reducing the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020- 116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., Folders: 1) Comparable corpus: the data of each source can be found in the corresponding folder. Each folder contains two other subfolders: - source: professional, specialized texts (".src" file extension) - target: simplified texts (".trg" file extension). 2) Aligned sentences: these can be found file "aligned.tsv" (in folder "aligned"). The folder structure is as follows: - aligned/ - aligned.tsv - cima/ - source/ - target/ - eudract/ - source/ - target/ - nci/ - source/ - target/., Peer reviewed




Medical Lexicon for Spanish (MedLexSp) [DATASET]

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
- MedLexSp.dsv: a delimiter-separated value file, with the following data fields: Field 1 is the UMLS CUI of the entity; field 2, the lemma; field 3, the variant forms; field 4, the part-of-speech; field 5, the semantic types(s); and field 6, the semantic group. - MedLexSp.xml: an XML-encoded version using the Lexical Markup Framework (LMF), which includes the morphological data (number, gender, verb tense and person, and information about affix/abbreviation data). The Document Type Definition file is also provided (lmf.dtd). - Lexical Record files: in subfolder "LR/": · LR_abr.dsv: list of equivalences between acronyms/abbreviations and full forms. · LR_affix.dsv: provides the equivalence between affixes/roots and their meanings. · LR_n_v.dsv: list of deverbal nouns. · LR_adj_n.dsv: list of adjectives derived from nouns. - Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py - Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.pt, MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes terms and inflected word forms with part-of-speech information and Unified Medical Language System (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used Natural Language Processing techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs 10, the Anatomical Therapeutical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. This dataset was collected during the NLPMedTerm project and the CLARA-MeD project, with the goal of creating a lexical resource for medical text processing in the Spanish language., MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs)., Spain, Latin America and United States of America (data from MedlinePlus Spanish and the Spanish version of the National Cancer Institute Dictionary of Medical Terms)., This dataset was collected in the NLPMedTerm project, funded by the European Union’s Horizon 2020 research programme under the Marie Skodowska-Curie grant agreement nº. 713366 (InterTalentum UAM), and the CLARA-MeD project (PID2020-116001RA-C33), funded by MCIN/AEI/10.13039/501100011033/, in project call: "Proyectos I+D+i Retos Investigación"., Peer reviewed




MedLexSp – a medical lexicon for Spanish medical natural language processing

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
Este artículo está sujeto a una licencia CC BY 4.0, [Background] Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish., [Construction and content] This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System®
(UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries., [Conclusions] The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository., Open Access funding provided thanks to the CRUE-CSIC agreement with
Springer Nature. This work has been done under the NLPMedTerm project,
funded by the European Union’s Horizon 2020 research program under
the Marie Skodowska-Curie grant agreement no. 713366 (InterTalentum
UAM), and the CLARA-MeD project (PID2020-116001RA-C33), funded by
MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos
Investigación”., Peer reviewed




CLARA-MeD simplified sentences

Digital.CSIC. Repositorio Institucional del CSIC
  • Bartolomé Rodríguez, Rocío
  • Terroba Reinares, Ana Rosa
  • Campillos-Llanos, Leonardo
A collection of 1200 pairs of technical and simplified sentences in two versions:

- Syntactically simplified sentences.
- Sentences with syntactic and lexical simplification.

This is a benchmark for medical text simplification in Spanish., 1) claramed_synt_simp_aligned.tsv: file with the 1200 sentences (original, syntactic simplification, and syntactic and lexical simplification).
2) CLARA-MeD_simplif_guideline.pdf: annotation guideline., [Description of methods used for collection/generation of data] The methods are explained in the following article: Leonardo Campillos-Llanos, Ana Rosa Terroba Reinares, Rocío Bartolomé Rodríguez (2022) "Enhancing the understanding of clinical trials with a sentence-level simplification dataset". Procesamiento del lenguaje natural, nº 72., [Methods for processing the data] Manual revision of technical sentences and simplification according to the criteria defined in the companion guideline., This dataset contains 1200 manually simplified sentences (144 019 tokens) from clinical trials in Spanish. A total of 1040 announcements from the European Clinical Trials Register (EudraCT) were analyzed to select sentences with ambiguities or exceeding 25 words. Simplification criteria were devised in an annotation guideline, which is released publicly along the dataset., This resource was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reduce the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020- 116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., - TSV file with the following fields: 1) File ID: EudraCT code. 2) Source: specialized sentence. 3) Syntactic simplification: a simplified sentence with syntax-level operations. 4) Syntactic and lexical simplification: a fully simplified sentence. - Simplification guideline with linguistic criteria., Peer reviewed




SimpMedLexSp (Simple Medical Lexicon for Spanish)

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
  • Capllonch Carrión, Adrián
  • Terroba Reinares, Ana Rosa
  • Valverde Mateos, Ana
  • Hernando-Tundidor, Marisol
  • Mostazo-Fernández, Yara
Links to other publicly accessible locations of the data: https://github.com/lcampillos/CLARA-MeD (Accessed: 12/2/2024)., A medical lexicon of 14013 pairs of technical word forms and the corresponding simpli-fied synonym or definition. It is aimed at automatic text simplification in Spanish.
A subset of the lexicon (4642 term entries) was also normalized to Unified Medical Language System (UMLS) concept unique identifiers (CUIs). Note that the number of inflected forms was reduced after revision by experts, with regard to the version used in the published article (Campillos-Llanos et al. 2024).
This dataset was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reduce the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020-116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., File List: - simpmedlexsp.dsv: a delimiter-separated value file, with the following data fields: • Field 1: Unified Medical Language System (UMLS) CUI of the entity • Field 2: Term entry • Field 3: Simplified synonym or definition • Field 4: UMLS semantic types(s) • Field 5: the semantic group. - simpmedlexsp_forms.dsv: inflected forms (conjugated verbs and gender/plural variants) and terms not normalized to UMLS CUIs: • Field 1: Term entry • Field 2: Simplified synonym or definition, Peer reviewed




Enhancing the understanding of clinical trials with a sentence-level simplification dataset

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
  • Bartolomé, Rocío
  • Terroba Reinares, Ana R.
Este artículo está sujeto a una licencia CC BY-NC-ND 4.0, [EN] We introduce a dataset with 1200 manually simplified sentences (144 019
tokens) from clinical trials in Spanish. A total of 1040 announcements from the European Clinical Trials Register (EudraCT) were analyzed to select sentences with
ambiguities or exceeding 25 words. Simplification criteria were devised in an annotation guideline, which is released publicly along with the dataset. We obtained two
versions: syntactically simplified sentences, and sentences with syntactic and lexical
simplification. We report a quantitative, a qualitative and a human evaluation, in
which three independent evaluators assessed the grammaticality/fluency, semantic
adequacy and overall simplification. Results show that the resource is suitable for
advancing research on automatic simplification of medical texts., [ES] We introduce a dataset with 1200 manually simplified sentences (144 019 tokens) from clinical trials in Spanish. A total of 1040 announcements from the European Clinical Trials Register (EudraCT) were analyzed to select sentences with ambiguities or exceeding 25 words. Simplification criteria were devised in an annotation guideline, which is released publicly along with the dataset. We obtained two versions: syntactically simplified sentences, and sentences with syntactic and lexical simplification. We report a quantitative, a qualitative and a human evaluation, in which three independent evaluators assessed the grammaticality/fluency, semantic adequacy and overall simplification. Results show that the resource is suitable for advancing research on automatic simplification of medical texts., Peer reviewed




Replace, Paraphrase or Fine-tune? Evaluating Automatic Simplification for Medical Texts in Spanish

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
  • Terroba Reinares, Ana Rosa
  • Bartolomé, Rocío
  • Valverde, Ana
  • González, Cristina
  • Capllonch Carrión, Adrián
  • Heras, Jónathan
Este capítulo está sujeto a una licencia CC BY-NC 4.0, Patients can not always completely understand medical documents given the myriad of technical terms they contain.
Automatic text simplification techniques can help, but they must guarantee that the content is transmitted rigorously
and not creating wrong information. In this work, we tested: 1) lexicon-based simplification approaches, using a
Spanish lexicon of technical and laymen terms collected for this task (SimpMedLexSp); 2) deep-learning (DL) based
methods, with BART-based and prompt-learning-based models; and 3) a combination of both techniques. As a
test set, we used 5000 parallel (technical and laymen) sentence pairs: 3800 manually aligned sentences from the
CLARA-MeD corpus; and 1200 sentences from clinical trials simplified by linguists. We conducted a quantitative
evaluation with standard measures (BLEU, ROUGE and SARI) and a human evaluation, in which eleven subjects
scored the simplification output of several methods. In our experiments, the lexicon improved the quantitative results
when combined with the DL models. The simplified sentences using only the lexicon were assessed with the highest
scores regarding semantic adequacy; however, their fluency needs to be improved. The prompt-method had similar
ratings in this aspect and in simplification. We make available the models and the data to reproduce our results., This work was done in project CLARAMED (PID2020-116001RA-C33) funded by
MCIN/AEI/10.13039/501100011033/, in call:
"Proyectos I+D+i Retos Investigación"; and also
partially supported by Grant PID2020-115225RBI00 funded by MCIN/AEI/10.13039/501100011033., Peer reviewed




Cross-Linguistic Disease and Drug Detection in Cardiology Clinical Texts: Methods and Outcomes

Digital.CSIC. Repositorio Institucional del CSIC
  • Styll, Patrick
  • Campillos-Llanos, Leonardo
  • Kusa, Wojciech
  • Hanbury, Allan
This paper presents our approach to the MultiCardioNER lab at CLEF2024, focusing on disease detection in Spanish texts and drug detection in Italian, Spanish, and English texts. We enhance model performance through several strategies: (1) fine-tuning on automatically translated TREC Clinical Trials admission notes using Masked Language Modeling (MLM); (2) data augmentation with translated MTSamples processed through a Spanish medical lexicon (MedLexSp) for accurate vocabulary matching; and (3) employing sliding windows with overlap to improve data capture. Additionally, we use transfer learning with a clinical trials corpus (CT-EMB-SP) to refine the outcomes. We further fine-tune several already established disease and drug extraction models to leverage their extensive vocabulary and compare their performance to models trained from scratch. Our methods and experiments demonstrate notable improvements in multilingual clinical NER, as evidenced by our track results., Leonardo Campillos-Llanos’ work is conducted in the CLARA-MeD project (PID2020-116001RA-C33), funded by MICIU/AEI/10.13039/501100011033/, in call Proyectos I+D+i Retos Investigación., Peer reviewed




Entity normalization in a Spanish medical corpus using a UMLS‑based lexicon: findings and limitations

Digital.CSIC. Repositorio Institucional del CSIC
  • Báez, Pablo
  • Campillos-Llanos, Leonardo
  • Núñez, Fredy
  • Dunstan, Jocelyn
Entity normalization is a common strategy to resolve ambiguities by mapping all the
synonym mentions to a single concept identifer in standard terminology. Normalizing medical entities is challenging, especially for languages other than English, where lexical variation is considerably under-represented. Here, we report a new linguistic resource for medical entity normalization in Spanish. We applied a UMLS-based medical lexicon (MedLexSp) to automatically normalize mentions from 2000 medical referrals of the Chilean Waiting List Corpus. Three medical students manually revised the automatic normalization. The inter-coder agreement was computed, and the distribution of concepts, errors, and linguistic sources of variation was analyzed. The automatic method normalized 52% of the mentions, compared to 91% after manual revision. The lowest agreement between automatic and automatic-manual normalization was observed for Finding, Disease, and Procedure entities. Errors in normalization were associated with ortho-typographic, semantic, and grammatical linguistic inadequacies, mainly of the hyponymy/hyperonymy, polysemy/metonymy, and acronym-abbreviation types. This new resource can enrich dictionaries and lexicons with new mentions to improve the functioning of modern entity normalization methods. The linguistic analysis ofers insight into the sources of lexical variety in the Spanish clinical environment related to error generation using lexicon-based normalization methods. This article also introduces a workfow that can serve as a benchmark for comparison in studies replicating our analysis in Romance languages, This study was supported by ANID Postdoctoral FONDECYT 3210395 and FONDECYT
11201250, Basal Funds for Center of Excellence FB210005 (CMM), Millennium Science Initiative Program ICN2021_004 (iHealth) and ICN17_002 (IMFD), and partly supported by the CLARA-MeD project (PID2020-116001RA-C33, funded by MICIU/AEI/10.13039/501100011033/ in call “Retos de investigación”), Peer reviewed




CLARA-MeD Tool – A System to Help Patients Understand Clinical Trial Announcements and Consent Forms in Spanish

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
  • Ortega Riba, Federico
  • Terroba Reinares, Ana Rosa
  • Valverde Mateos, Ana
  • Capllonch Carrión, Adrián
We present an NLP web-based tool to help users understand consent forms (CFs) and clinical trial announcements (CTAs) in Spanish. For complex word identification, we collected: 1) a lexicon of technical terms and simplified synonyms (14 465 entries); and 2) a glossary (70 547 terms) with explanations from sources such as UMLS, the NCI dictionary, Orphadata or the FDA. For development, we extracted entities from 60 CTAs, 60 CFs and 60 patient information documents (PIDs). To prepare definitions for new terms, we used ChatGPT and experts validated them (28.99% needed to be fixed). We tested the system on 15 new CTAs, 15 CFs, and 15 PIDs, and we achieved an average F1 score of 82.91% (strict match) and of 94.65% (relaxed). The tool is available at: http://claramed.csic.es/demo., CLARA-MeD project (PID2020-116001RA-C33) funded by AEI/MICIU/10.13039/50
1100011033/, Peer reviewed