METODOS DE LA LINGUISTICA COMPUTACIONAL PARA LA LEGIBILIDAD Y SIMPLIFICACION AUTOMATICA DEL DISCURSO MEDICO

PID2020-116001RA-C33

Nombre agencia financiadora Agencia Estatal de Investigación
Acrónimo agencia financiadora AEI
Programa Programa Estatal de I+D+i Orientada a los Retos de la Sociedad
Subprograma Programa Estatal de I+D+i Orientada a los Retos de la Sociedad
Convocatoria Proyectos I+D
Año convocatoria 2020
Unidad de gestión Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020
Centro beneficiario AGENCIA ESTATAL CONSEJO SUPERIOR DE INVESTIGACIONES CIENTIFICAS (CSIC)
Identificador persistente http://dx.doi.org/10.13039/501100011033

Publicaciones

Found(s) 3 result(s)
Found(s) 1 page(s)

CLARA-MeD corpus

  • Campillos-Llanos, Leonardo
  • Terroba Reinares, Ana Rosa
  • Zakhir Puig, Sofía
  • Valverde Mateos, Ana
  • Capllonch Carrión, Adrián
A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens).
The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022., A collection of 24 298 pairs of professional and simplified texts (>96 million tokens) for automatic medical text simplification in Spanish.
A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) is released as a benchmark for medical text simplification.
This dataset was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reducing the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020- 116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., Folders: 1) Comparable corpus: the data of each source can be found in the corresponding folder. Each folder contains two other subfolders: - source: professional, specialized texts (".src" file extension) - target: simplified texts (".trg" file extension). 2) Aligned sentences: these can be found file "aligned.tsv" (in folder "aligned"). The folder structure is as follows: - aligned/ - aligned.tsv - cima/ - source/ - target/ - eudract/ - source/ - target/ - nci/ - source/ - target/., Peer reviewed



Medical Lexicon for Spanish (MedLexSp)

  • Campillos-Llanos, Leonardo
- MedLexSp.dsv: a delimiter-separated value file, with the following data fields: Field 1 is the UMLS CUI of the entity; field 2, the lemma; field 3, the variant forms; field 4, the part-of-speech; field 5, the semantic types(s); and field 6, the semantic group.
- MedLexSp.xml: an XML-encoded version using the Lexical Markup Framework (LMF), which includes the morphological data (number, gender, verb tense and person, and information about affix/abbreviation data). The Document Type Definition file is also provided (lmf.dtd).
- Lexical Record files: in subfolder "LR/":
· LR_abr.dsv: list of equivalences between acronyms/abbreviations and full forms.
· LR_affix.dsv: provides the equivalence between affixes/roots and their meanings.
· LR_n_v.dsv: list of deverbal nouns.
· LR_adj_n.dsv: list of adjectives derived from nouns.
- Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py
- Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.pt, File List:
1) MedLexSp.dsv;
2) MedLexSp.xml and lmf.dtd (Document Type Definition);
3) Lexical Record files: in subfolder "LR/":
3.1) LR_abr.dsv;
3.2) LR_affix.dsv;
3.3) LR_n_v.dsv;
3.4) LR_adj_n.dsv;
4) Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py
5) Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.pt
See more information about the format below.
Companion code and files can be found in the github repository: https://github.com/lcampillos/MedLexSp, MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes terms and inflected word forms with part-of-speech information and Unified Medical Language System (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used Natural Language Processing techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs 10, the Anatomical Therapeutical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus.
This dataset was collected during the NLPMedTerm project and the CLARA-MeD project, with the goal of creating a lexical resource for medical text processing in the Spanish language., MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs)., Spain, Latin America and United States of America (data from MedlinePlus Spanish and the Spanish version of the National Cancer Institute Dictionary of Medical Terms)., This dataset was collected in the NLPMedTerm project, funded by the European Union’s Horizon 2020 research programme under the Marie Skodowska-Curie grant agreement nº. 713366 (InterTalentum UAM), and the CLARA-MeD project (PID2020-116001RA-C33), funded by MCIN/AEI/10.13039/501100011033/, in project call: "Proyectos I+D+i Retos Investigación"., Peer reviewed



Medical Lexicon for Spanish (MedLexSp)

  • Campillos-Llanos, Leonardo
- MedLexSp.dsv: a delimiter-separated value file, with the following data fields: Field 1 is the UMLS CUI of the entity; field 2, the lemma; field 3, the variant forms; field 4, the part-of-speech; field 5, the semantic types(s); and field 6, the semantic group. - MedLexSp.xml: an XML-encoded version using the Lexical Markup Framework (LMF), which includes the morphological data (number, gender, verb tense and person, and information about affix/abbreviation data). The Document Type Definition file is also provided (lmf.dtd). - Lexical Record files: in subfolder "LR/": · LR_abr.dsv: list of equivalences between acronyms/abbreviations and full forms. · LR_affix.dsv: provides the equivalence between affixes/roots and their meanings. · LR_n_v.dsv: list of deverbal nouns. · LR_adj_n.dsv: list of adjectives derived from nouns. - Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py - Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.pt, File List: 1) MedLexSp.dsv; 2) MedLexSp.xml and lmf.dtd (Document Type Definition); 3) Lexical Record files: in subfolder "LR/": 3.1) LR_abr.dsv; 3.2) LR_affix.dsv; 3.3) LR_n_v.dsv; 3.4) LR_adj_n.dsv; 4) Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py 5) Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.pt See more information about the format below. Companion code and files can be found in the github repository: https://github.com/lcampillos/MedLexSp, MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes terms and inflected word forms with part-of-speech information and Unified Medical Language System (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used Natural Language Processing techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs 10, the Anatomical Therapeutical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. This dataset was collected during the NLPMedTerm project and the CLARA-MeD project, with the goal of creating a lexical resource for medical text processing in the Spanish language., MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs)., Spain, Latin America and United States of America (data from MedlinePlus Spanish and the Spanish version of the National Cancer Institute Dictionary of Medical Terms)., This dataset was collected in the NLPMedTerm project, funded by the European Union’s Horizon 2020 research programme under the Marie Skodowska-Curie grant agreement nº. 713366 (InterTalentum UAM), and the CLARA-MeD project (PID2020-116001RA-C33), funded by MCIN/AEI/10.13039/501100011033/, in project call: "Proyectos I+D+i Retos Investigación"., Peer reviewed