Resultados totales (Incluyendo duplicados): 3
Encontrada(s) 1 página(s)
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/269887
Dataset. 2022

CLARA-MED CORPUS

  • Campillos-Llanos, Leonardo
  • Terroba Reinares, Ana Rosa
  • Zakhir Puig, Sofía
  • Valverde Mateos, Ana
  • Capllonch Carrión, Adrián
A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens). The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022., A collection of 24 298 pairs of professional and simplified texts (>96 million tokens) for automatic medical text simplification in Spanish. A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) is released as a benchmark for medical text simplification. This dataset was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reducing the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020- 116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., Folders: 1) Comparable corpus: the data of each source can be found in the corresponding folder. Each folder contains two other subfolders: - source: professional, specialized texts (".src" file extension) - target: simplified texts (".trg" file extension). 2) Aligned sentences: these can be found file "aligned.tsv" (in folder "aligned"). The folder structure is as follows: - aligned/ - aligned.tsv - cima/ - source/ - target/ - eudract/ - source/ - target/ - nci/ - source/ - target/., Peer reviewed

DOI: http://hdl.handle.net/10261/269887, https://doi.org/10.20350/digitalCSIC/14644
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/269887
HANDLE: http://hdl.handle.net/10261/269887, https://doi.org/10.20350/digitalCSIC/14644
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/269887
PMID: http://hdl.handle.net/10261/269887, https://doi.org/10.20350/digitalCSIC/14644
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/269887
Ver en: http://hdl.handle.net/10261/269887, https://doi.org/10.20350/digitalCSIC/14644
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/269887

Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/270429
Dataset. 2022

MEDICAL LEXICON FOR SPANISH (MEDLEXSP) [DATASET]

  • Campillos-Llanos, Leonardo
- MedLexSp.dsv: a delimiter-separated value file, with the following data fields: Field 1 is the UMLS CUI of the entity; field 2, the lemma; field 3, the variant forms; field 4, the part-of-speech; field 5, the semantic types(s); and field 6, the semantic group. - MedLexSp.xml: an XML-encoded version using the Lexical Markup Framework (LMF), which includes the morphological data (number, gender, verb tense and person, and information about affix/abbreviation data). The Document Type Definition file is also provided (lmf.dtd). - Lexical Record files: in subfolder "LR/": · LR_abr.dsv: list of equivalences between acronyms/abbreviations and full forms. · LR_affix.dsv: provides the equivalence between affixes/roots and their meanings. · LR_n_v.dsv: list of deverbal nouns. · LR_adj_n.dsv: list of adjectives derived from nouns. - Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py - Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.pt, MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes terms and inflected word forms with part-of-speech information and Unified Medical Language System (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used Natural Language Processing techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs 10, the Anatomical Therapeutical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. This dataset was collected during the NLPMedTerm project and the CLARA-MeD project, with the goal of creating a lexical resource for medical text processing in the Spanish language., MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs)., Spain, Latin America and United States of America (data from MedlinePlus Spanish and the Spanish version of the National Cancer Institute Dictionary of Medical Terms)., This dataset was collected in the NLPMedTerm project, funded by the European Union’s Horizon 2020 research programme under the Marie Skodowska-Curie grant agreement nº. 713366 (InterTalentum UAM), and the CLARA-MeD project (PID2020-116001RA-C33), funded by MCIN/AEI/10.13039/501100011033/, in project call: "Proyectos I+D+i Retos Investigación"., Peer reviewed

DOI: http://hdl.handle.net/10261/270429, https://doi.org/10.20350/digitalCSIC/14656
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/270429
HANDLE: http://hdl.handle.net/10261/270429, https://doi.org/10.20350/digitalCSIC/14656
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/270429
PMID: http://hdl.handle.net/10261/270429, https://doi.org/10.20350/digitalCSIC/14656
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/270429
Ver en: http://hdl.handle.net/10261/270429, https://doi.org/10.20350/digitalCSIC/14656
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/270429

Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/346579
Dataset. 2024

CLARA-MED SIMPLIFIED SENTENCES

  • Bartolomé Rodríguez, Rocío
  • Terroba Reinares, Ana Rosa
  • Campillos-Llanos, Leonardo
A collection of 1200 pairs of technical and simplified sentences in two versions: - Syntactically simplified sentences. - Sentences with syntactic and lexical simplification. This is a benchmark for medical text simplification in Spanish., 1) claramed_synt_simp_aligned.tsv: file with the 1200 sentences (original, syntactic simplification, and syntactic and lexical simplification). 2) CLARA-MeD_simplif_guideline.pdf: annotation guideline., [Description of methods used for collection/generation of data] The methods are explained in the following article: Leonardo Campillos-Llanos, Ana Rosa Terroba Reinares, Rocío Bartolomé Rodríguez (2022) "Enhancing the understanding of clinical trials with a sentence-level simplification dataset". Procesamiento del lenguaje natural, nº 72., [Methods for processing the data] Manual revision of technical sentences and simplification according to the criteria defined in the companion guideline., This dataset contains 1200 manually simplified sentences (144 019 tokens) from clinical trials in Spanish. A total of 1040 announcements from the European Clinical Trials Register (EudraCT) were analyzed to select sentences with ambiguities or exceeding 25 words. Simplification criteria were devised in an annotation guideline, which is released publicly along the dataset., This resource was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reduce the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020- 116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., - TSV file with the following fields: 1) File ID: EudraCT code. 2) Source: specialized sentence. 3) Syntactic simplification: a simplified sentence with syntax-level operations. 4) Syntactic and lexical simplification: a fully simplified sentence. - Simplification guideline with linguistic criteria., Peer reviewed

DOI: http://hdl.handle.net/10261/346579, https://doi.org/10.20350/digitalCSIC/16110
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/346579
HANDLE: http://hdl.handle.net/10261/346579, https://doi.org/10.20350/digitalCSIC/16110
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/346579
PMID: http://hdl.handle.net/10261/346579, https://doi.org/10.20350/digitalCSIC/16110
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/346579
Ver en: http://hdl.handle.net/10261/346579, https://doi.org/10.20350/digitalCSIC/16110
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/346579

Buscador avanzado