Dataset.
Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp)
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/372850
Digital.CSIC. Repositorio Institucional del CSIC
- Ortega Riba, Federico
- Campillos-Llanos, Leonardo
[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task". (Under review).
[Methods for processing the data] Manual annotation of complex words (CW) according to the criteria defined in the guideline explained in the companion article., The corpus is made up of 225 texts in Spanish annotated with complex words (CW). It contains three text types: consent forms (75 texts), clinical trial announcements (75 texts) and patient information documents (75 texts).
This resource is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts., This dataset was collected in the CLARA-MeD project (PID2020-116001RA-C33), with funding from the Spanish government by MICIU/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., The corpus is made up of 225 texts. It is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts. The corpus contains three text types: • Consent forms (75 texts) • Clinical trial announcements (75 texts) • Patient information leaflets (75 texts) - ANN: Contains BRAT annotated files (.ann) and corresponding text files (.txt): These are separated in three folders and subfolders (corresponding to each text type): • TRAIN: ▫ ci: 51 consent forms ('consentimientos informados') ▫ eudract: 51 clinical trial announcements from REEC and EudraCT ▫ info: 51 patient-oriented information leaflets • DEV: ▫ ci: 9 consent forms FALTA UNO ▫ eudract: 9 clinical trial announcements ▫ info: 9 patient-oriented information leaflets • TEST: ▫ ci: 15 consent forms ▫ eudract: 15 clinical trial announcements ▫ info: 15 patient-oriented information leaflets - JSON files for transformer models: These are separated in TRAIN, DEV and TEST. - CSV files with the processed data, corresponding to TRAIN, DEV and TEST data. These were used for the machine learning experiments. Each file has the following fields: • Token • Label: it encodes the class (CW, 'complex word') and if the token is the Beginning of the entity (B), if it is Inside (I) or Outside (O)., Peer reviewed
DOI: http://hdl.handle.net/10261/372850
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/372850
HANDLE: http://hdl.handle.net/10261/372850
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/372850
Ver en: http://hdl.handle.net/10261/372850
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/372850
No hay resultados en la búsqueda
No hay resultados en la búsqueda
×
2 Versiones
2 Versiones
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/372850
Dataset. 2024
CORPUS FOR COMPLEX WORD IDENTIFICATION IN MEDICAL SPANISH TEXTS (CWI-MED-SP)
Digital.CSIC. Repositorio Institucional del CSIC
- Ortega Riba, Federico
- Campillos-Llanos, Leonardo
[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task". (Under review).
[Methods for processing the data] Manual annotation of complex words (CW) according to the criteria defined in the guideline explained in the companion article., The corpus is made up of 225 texts in Spanish annotated with complex words (CW). It contains three text types: consent forms (75 texts), clinical trial announcements (75 texts) and patient information documents (75 texts).
This resource is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts., This dataset was collected in the CLARA-MeD project (PID2020-116001RA-C33), with funding from the Spanish government by MICIU/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., The corpus is made up of 225 texts. It is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts. The corpus contains three text types: • Consent forms (75 texts) • Clinical trial announcements (75 texts) • Patient information leaflets (75 texts) - ANN: Contains BRAT annotated files (.ann) and corresponding text files (.txt): These are separated in three folders and subfolders (corresponding to each text type): • TRAIN: ▫ ci: 51 consent forms ('consentimientos informados') ▫ eudract: 51 clinical trial announcements from REEC and EudraCT ▫ info: 51 patient-oriented information leaflets • DEV: ▫ ci: 9 consent forms FALTA UNO ▫ eudract: 9 clinical trial announcements ▫ info: 9 patient-oriented information leaflets • TEST: ▫ ci: 15 consent forms ▫ eudract: 15 clinical trial announcements ▫ info: 15 patient-oriented information leaflets - JSON files for transformer models: These are separated in TRAIN, DEV and TEST. - CSV files with the processed data, corresponding to TRAIN, DEV and TEST data. These were used for the machine learning experiments. Each file has the following fields: • Token • Label: it encodes the class (CW, 'complex word') and if the token is the Beginning of the entity (B), if it is Inside (I) or Outside (O)., Peer reviewed
Digital.CSIC. Repositorio Institucional del CSIC
oai:digital.csic.es:10261/373675
Dataset. 2024
CORPUS FOR COMPLEX WORD IDENTIFICATION IN MEDICAL SPANISH TEXTS (CWI-MED-SP)
Digital.CSIC. Repositorio Institucional del CSIC
- Ortega Riba, Federico
- Campillos-Llanos, Leonardo
[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task". (Under review). [Methods for processing the data] Manual annotation of complex words (CW) according to the criteria defined in the guideline explained in the companion article., The corpus is made up of 225 texts in Spanish annotated with complex words (CW). It contains three text types: consent forms (75 texts), clinical trial announcements (75 texts) and patient information documents (75 texts). This resource is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts., This dataset was collected in the CLARA-MeD project (PID2020-116001RA-C33), with funding from the Spanish government by MICIU/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”., The corpus is made up of 225 texts. It is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts. The corpus contains three text types: • Consent forms (75 texts) • Clinical trial announcements (75 texts) • Patient information leaflets (75 texts) - ANN: Contains BRAT annotated files (.ann) and corresponding text files (.txt): These are separated in three folders and subfolders (corresponding to each text type): • TRAIN: ▫ ci: 51 consent forms ('consentimientos informados') ▫ eudract: 51 clinical trial announcements from REEC and EudraCT ▫ info: 51 patient-oriented information leaflets • DEV: ▫ ci: 9 consent forms FALTA UNO ▫ eudract: 9 clinical trial announcements ▫ info: 9 patient-oriented information leaflets • TEST: ▫ ci: 15 consent forms ▫ eudract: 15 clinical trial announcements ▫ info: 15 patient-oriented information leaflets - JSON files for transformer models: These are separated in TRAIN, DEV and TEST. - CSV files with the processed data, corresponding to TRAIN, DEV and TEST data. These were used for the machine learning experiments. Each file has the following fields: • Token • Label: it encodes the class (CW, 'complex word') and if the token is the Beginning of the entity (B), if it is Inside (I) or Outside (O)., Peer reviewed
There are no results for this search
1106