METODOS DE LA LINGUISTICA COMPUTACIONAL PARA LA LEGIBILIDAD Y SIMPLIFICACION AUTOMATICA DEL DISCURSO MEDICO

PID2020-116001RA-C33

Nombre agencia financiadora Agencia Estatal de Investigación
Acrónimo agencia financiadora AEI
Programa Programa Estatal de I+D+i Orientada a los Retos de la Sociedad
Subprograma Programa Estatal de I+D+i Orientada a los Retos de la Sociedad
Convocatoria Proyectos I+D
Año convocatoria 2020
Unidad de gestión Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020
Centro beneficiario AGENCIA ESTATAL CONSEJO SUPERIOR DE INVESTIGACIONES CIENTIFICAS (CSIC)
Identificador persistente http://dx.doi.org/10.13039/501100011033

Publicaciones

Found(s) 6 result(s)
Found(s) 1 page(s)

Building a comparable corpus and a benchmark for Spanish medical text simplification, Construcción de un corpus comparable y un recurso de referencia para la simplificación de textos médicos en español

RUA. Repositorio Institucional de la Universidad de Alicante
  • Campillos Llanos, Leonardo
  • Terroba Reinares, Ana R.
  • Zakhir Puig, Sofía
  • Valverde, Ana
  • Capllonch-Carrión, Adrián
We report the collection of the CLARA-MeD comparable corpus, which is made up of 24 298 pairs of professional and simplified texts in the medical domain for the Spanish language (>96M tokens). Texts types range from drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words), abstracts of systematic reviews (8138 pairs of texts, >9M words), cancer-related information summaries (201 pairs of texts, >3M tokens) and clinical trials announcements (5748 pairs of texts, 451 690 words). We also report the alignment of professional and simplified sentences, conducted manually by pairs of annotators. A subset of 3800 sentence pairs (149 862 tokens) has been aligned each by 2 experts, with an average inter-annotator agreement kappa score of 0.839 (0.076). The data are available in the community and contributes with a new benchmark to develop and evaluate automatic medical text simplification systems., Se describe la recogida del corpus comparable CLARA-MeD, formado por 24 298 pares de textos profesionales y simplificados de dominio médico en lengua española (>96M palabras). Los tipos de textos varían desde prospectos médicos y fichas técnicas de medicamentos (10 211 pares de textos, >82M palabras), resúmenes de revisiones sistemáticas (8138 pares de textos, >9M palabras), resúmenes de información sobre el cáncer (201 pares de textos, >3M palabras) y anuncios de ensayos clínicos (5748 pares de textos, 451 690 palabras). También presentamos el alineamiento de frases técnicas y simplificadas, realizado a mano por pares de anotadores. Un subconjunto de 3800 pares de frases (149 862 tokens) se han emparejado, con un acuerdo medio entre anotadores con valor kappa = 0.839 (0.076). Los datos están disponibles en la comunidad y este nuevo recurso permite desarrollar y evaluar sistemas de simplificación automática de textos médicos., Project CLARA-MED (PID2020-116001RA-C33) funded by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”.




CLARA-MeD corpus

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
  • Terroba Reinares, Ana Rosa
  • Zakhir Puig, Sofía
  • Valverde Mateos, Ana
  • Capllonch Carrión, Adrián
A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens).
The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022., A collection of 24 298 pairs of professional and simplified texts (>96 million tokens) for automatic medical text simplification in Spanish.
A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) is released as a benchmark for medical text simplification.
This dataset was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reducing the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020- 116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., Folders: 1) Comparable corpus: the data of each source can be found in the corresponding folder. Each folder contains two other subfolders: - source: professional, specialized texts (".src" file extension) - target: simplified texts (".trg" file extension). 2) Aligned sentences: these can be found file "aligned.tsv" (in folder "aligned"). The folder structure is as follows: - aligned/ - aligned.tsv - cima/ - source/ - target/ - eudract/ - source/ - target/ - nci/ - source/ - target/., Peer reviewed




Medical Lexicon for Spanish (MedLexSp)

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
- MedLexSp.dsv: a delimiter-separated value file, with the following data fields: Field 1 is the UMLS CUI of the entity; field 2, the lemma; field 3, the variant forms; field 4, the part-of-speech; field 5, the semantic types(s); and field 6, the semantic group. - MedLexSp.xml: an XML-encoded version using the Lexical Markup Framework (LMF), which includes the morphological data (number, gender, verb tense and person, and information about affix/abbreviation data). The Document Type Definition file is also provided (lmf.dtd). - Lexical Record files: in subfolder "LR/": · LR_abr.dsv: list of equivalences between acronyms/abbreviations and full forms. · LR_affix.dsv: provides the equivalence between affixes/roots and their meanings. · LR_n_v.dsv: list of deverbal nouns. · LR_adj_n.dsv: list of adjectives derived from nouns. - Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py - Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.pt, File List: 1) MedLexSp.dsv; 2) MedLexSp.xml and lmf.dtd (Document Type Definition); 3) Lexical Record files: in subfolder "LR/": 3.1) LR_abr.dsv; 3.2) LR_affix.dsv; 3.3) LR_n_v.dsv; 3.4) LR_adj_n.dsv; 4) Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py 5) Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.pt See more information about the format below. Companion code and files can be found in the github repository: https://github.com/lcampillos/MedLexSp, MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes terms and inflected word forms with part-of-speech information and Unified Medical Language System (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used Natural Language Processing techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs 10, the Anatomical Therapeutical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. This dataset was collected during the NLPMedTerm project and the CLARA-MeD project, with the goal of creating a lexical resource for medical text processing in the Spanish language., MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs)., Spain, Latin America and United States of America (data from MedlinePlus Spanish and the Spanish version of the National Cancer Institute Dictionary of Medical Terms)., This dataset was collected in the NLPMedTerm project, funded by the European Union’s Horizon 2020 research programme under the Marie Skodowska-Curie grant agreement nº. 713366 (InterTalentum UAM), and the CLARA-MeD project (PID2020-116001RA-C33), funded by MCIN/AEI/10.13039/501100011033/, in project call: "Proyectos I+D+i Retos Investigación"., Peer reviewed




MedLexSp – a medical lexicon for Spanish medical natural language processing

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made, [Background] Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish., [Construction and content] This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System®
(UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries., [Conclusions] The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository., Open Access funding provided thanks to the CRUE-CSIC agreement with
Springer Nature. This work has been done under the NLPMedTerm project,
funded by the European Union’s Horizon 2020 research program under
the Marie Skodowska-Curie grant agreement no. 713366 (InterTalentum
UAM), and the CLARA-MeD project (PID2020-116001RA-C33), funded by
MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos
Investigación”., Peer reviewed




CLARA-MeD simplified sentences

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
  • Terroba Reinares, Ana Rosa
  • Bartolomé Rodríguez, Rocío
A collection of 1200 pairs of technical and simplified sentences in two versions:

- Syntactically simplified sentences.
- Sentences with syntactic and lexical simplification.

This is a benchmark for medical text simplification in Spanish., 1) claramed_synt_simp_aligned.tsv: file with the 1200 sentences (original, syntactic simplification, and syntactic and lexical simplification).
2) CLARA-MeD_simplif_guideline.pdf: annotation guideline., [Description of methods used for collection/generation of data] The methods are explained in the following article: Leonardo Campillos-Llanos, Ana Rosa Terroba Reinares, Rocío Bartolomé Rodríguez (2022) "Enhancing the understanding of clinical trials with a sentence-level simplification dataset". Procesamiento del lenguaje natural, nº 72., [Methods for processing the data] Manual revision of technical sentences and simplification according to the criteria defined in the companion guideline., This dataset contains 1200 manually simplified sentences (144 019 tokens) from clinical trials in Spanish. A total of 1040 announcements from the European Clinical Trials Register (EudraCT) were analyzed to select sentences with ambiguities or exceeding 25 words. Simplification criteria were devised in an annotation guideline, which is released publicly along the dataset., This resource was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reduce the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020- 116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., - TSV file with the following fields: 1) File ID: EudraCT code. 2) Source: specialized sentence. 3) Syntactic simplification: a simplified sentence with syntax-level operations. 4) Syntactic and lexical simplification: a fully simplified sentence. - Simplification guideline with linguistic criteria., Peer reviewed




SimpMedLexSp (Simple Medical Lexicon for Spanish)

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
  • Capllonch Carrión, Adrián
  • Terroba Reinares, Ana Rosa
  • Valverde Mateos, Ana
  • Hernando-Tundidor, Marisol
  • Mostazo-Fernández, Yara
Links to other publicly accessible locations of the data: https://github.com/lcampillos/CLARA-MeD (Accessed: 12/2/2024)., A medical lexicon of 14013 pairs of technical word forms and the corresponding simpli-fied synonym or definition. It is aimed at automatic text simplification in Spanish.
A subset of the lexicon (4642 term entries) was also normalized to Unified Medical Language System (UMLS) concept unique identifiers (CUIs). Note that the number of inflected forms was reduced after revision by experts, with regard to the version used in the published article (Campillos-Llanos et al. 2024).
This dataset was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reduce the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020-116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., File List: - simpmedlexsp.dsv: a delimiter-separated value file, with the following data fields: • Field 1: Unified Medical Language System (UMLS) CUI of the entity • Field 2: Term entry • Field 3: Simplified synonym or definition • Field 4: UMLS semantic types(s) • Field 5: the semantic group. - simpmedlexsp_forms.dsv: inflected forms (conjugated verbs and gender/plural variants) and terms not normalized to UMLS CUIs: • Field 1: Term entry • Field 2: Simplified synonym or definition, Peer reviewed