Dataset.

CLARA-MeD corpus

Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/269887
Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
  • Terroba Reinares, Ana Rosa
  • Zakhir Puig, Sofía
  • Valverde Mateos, Ana
  • Capllonch Carrión, Adrián
A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens). The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022., A collection of 24 298 pairs of professional and simplified texts (>96 million tokens) for automatic medical text simplification in Spanish. A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) is released as a benchmark for medical text simplification. This dataset was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reducing the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020- 116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., Folders: 1) Comparable corpus: the data of each source can be found in the corresponding folder. Each folder contains two other subfolders: - source: professional, specialized texts (".src" file extension) - target: simplified texts (".trg" file extension). 2) Aligned sentences: these can be found file "aligned.tsv" (in folder "aligned"). The folder structure is as follows: - aligned/ - aligned.tsv - cima/ - source/ - target/ - eudract/ - source/ - target/ - nci/ - source/ - target/., Peer reviewed
 

DOI: http://hdl.handle.net/10261/269887, https://doi.org/10.20350/digitalCSIC/14644
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/269887

HANDLE: http://hdl.handle.net/10261/269887, https://doi.org/10.20350/digitalCSIC/14644
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/269887
 
Ver en: http://hdl.handle.net/10261/269887, https://doi.org/10.20350/digitalCSIC/14644
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/269887

Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/269887
Dataset. 2022

CLARA-MED CORPUS

Digital.CSIC. Repositorio Institucional del CSIC
  • Campillos-Llanos, Leonardo
  • Terroba Reinares, Ana Rosa
  • Zakhir Puig, Sofía
  • Valverde Mateos, Ana
  • Capllonch Carrión, Adrián
A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens). The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022., A collection of 24 298 pairs of professional and simplified texts (>96 million tokens) for automatic medical text simplification in Spanish. A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) is released as a benchmark for medical text simplification. This dataset was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reducing the language barrier to patient's informed decision making. In particular, the project aims at developing linguistic resources for automatic medical term simplification in Spanish; and conducting experiments in automatic text simplification., This dataset was collected in the CLARA-MeD project (PID2020- 116001RA-C33), with funding from the Spanish government by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., Folders: 1) Comparable corpus: the data of each source can be found in the corresponding folder. Each folder contains two other subfolders: - source: professional, specialized texts (".src" file extension) - target: simplified texts (".trg" file extension). 2) Aligned sentences: these can be found file "aligned.tsv" (in folder "aligned"). The folder structure is as follows: - aligned/ - aligned.tsv - cima/ - source/ - target/ - eudract/ - source/ - target/ - nci/ - source/ - target/., Peer reviewed





1106