Buscador | Buscador

Resultados totales (Incluyendo duplicados): 7
Encontrada(s) 1 página(s)

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/1RRAWJ

Dataset. 2020

HESML V1R5 JAVA SOFTWARE LIBRARY OF ONTOLOGY-BASED SEMANTIC SIMILARITY MEASURES AND INFORMATION CONTENT MODELS

Lastra-Díaz, Juan J.
Lara-Clares, Alicia
Garcia-Serrano, Ana

This dataset introduces HESML V1R5 which is the fifth release of the Half-Edge Semantic Measures Library (HESML) detailed in [13]. HESML V1R5 is a linearly scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models for ontolgies like WordNet, SNOMED-CT, MeSH, GO and any other ontologies based on the OBO file format. HESML V1R5 implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature, as well as the evaluation of three pre-trained word embedding models. It also provides a XML-based input file format in order to specify the execution of reproducible word/concept similarity experiments based on WordNet, SNOMED-CT, MeSH, or GO without software coding. HESML V1R5 introduces the following novelties: (1) the parsing and in-memory representation of the SNOMED-CT, MeSH and any other ontologies based on the OBO file format such as the Gene Ontology (GO); (2) a new collection of efficient path-based similarity measures based on the reformulation of previous path-based measures which are based on the new Ancestors-based Shortest-Path Length (AncSPL) algorithm; and (3) a collection of groupwise similarity measures. HESML library is freely distributed for any non-commercial purpose under a CC By-NC-SA-4.0 license, subject to the citing of the two mains HESML papers as attribution requirement. However, HESML distribution also includes other datasets, databases or data files whose use require the attribution acknowledgement by any user of HEMSL. Thus, we urge to the HESML users to fulfill with licensing terms related to other resources distributed with the library as detailed in its companion release notes.

Proyecto: UNED/BICI N7/

DOI: https://doi.org/10.21950/1RRAWJ

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/1RRAWJ

HANDLE: https://doi.org/10.21950/1RRAWJ

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/1RRAWJ

PMID: https://doi.org/10.21950/1RRAWJ

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/1RRAWJ

Ver en: https://doi.org/10.21950/1RRAWJ

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/1RRAWJ

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/AQ1CVX

Dataset. 2018

WORD SIMILARITY BENCHMARKS OF RECENT WORD EMBEDDING MODELS AND ONTOLOGY-BASED SEMANTIC SIMILARITY MEASURES

Lastra-Díaz, Juan J.
Goikoetxea, Josu
Hadj Taieb, Mohamed Ali
Garcia-Serrano, Ana
Ben Aouicha, Mohamed
Agirre, Eneko

This dataset is a companion reproducibility package of the related paper submitted for publication, whose aim is to allow the exact replication of a very large experimental survey on word similarity between the families of ontology-based semantic similarity measures and word embedding models as detailed in ‘appendix-reproducible-experiments.pdf’ file. Our experiments are based on the evaluation of all methods with the HESML V1R4 semantic measures library and the recording of these experiments with Reprozip. HESML is a self-contained Java software library of semantic measures based on WordNet whose latest version, called HESML V1R4, also supports the evaluation of pre-trained word embedding files. HESML is a self-contained experimentation platform on word similarity which is especially well suited to run large experimental surveys by supporting the execution of automatic reproducible experiment files on word similarity based on a XML-based file format called (*.exp). On the other hand, ReproZip is a virtualisation tool whose aim is to warrant the exact replication of experimental results onto a different system from that originally used in their creation. Reprozip captures all the program dependencies and is able to reproduce the packaged experiments on any host platform, regardless of the hardware and software configuration used in their creation. Thus, ReproZip warrants the reproduction of the experiments introduced herein in the long-term. Finally, other very valuable feature of Reprozip is that it allows to modify the input files of any Reprozip package with the aim of evaluating a set of experiments using originally unconsidered methods, configuration parameters or datasets. This dataset contains a Reprozip package to reproduce our experiments in any supported platform, as well as all pre-trained word embedding models and word similarity datasets used in our experiments. In addition, this dataset also contains all raw output files generated by our experiments, and a R script file to generate all output processed files corresponding to the data tables in our related paper. Finally, we provide a very detailed experimental setup in the aforementioned PDF file to allow all our experiments to be reproduced exactly.

Proyecto: //

DOI: https://doi.org/10.21950/AQ1CVX

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/AQ1CVX

HANDLE: https://doi.org/10.21950/AQ1CVX

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/AQ1CVX

PMID: https://doi.org/10.21950/AQ1CVX

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/AQ1CVX

Ver en: https://doi.org/10.21950/AQ1CVX

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/AQ1CVX

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/AQLSMV

Dataset. 2022

HESML V2R1 JAVA SOFTWARE LIBRARY OF SEMANTIC SIMILARITY MEASURES FOR THE BIOMEDICAL DOMAIN

Lara-Clares, Alicia
Lastra-Díaz, Juan J.
Garcia-Serrano, Ana

This dataset introduces HESML V2R1 which is the sixth release of the Half-Edge Semantic Measures Library (HESML) detailed in [24]. HESML V2R1 is a linearly scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models for ontologies like WordNet, SNOMED-CT, MeSH, GO and any other ontologies based on the OBO file format. HESML V2R1 also implements most of the sentence similarity methods in the biomedical domain together with a set of sentence pre-processing configurations, the integration of the three main biomedical NER tools, Metamap [3], MetamapLite [7] and cTAKES [31]. HESML V2R1 implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature, as well as the evaluation of three pre-trained word embedding models for the general domain and 33 pre-trained embeddings and language models. It also provides a XML-based input file format in order to specify the execution of reproducible word/concept similarity experiments based on WordNet, SNOMED-CT, MeSH, or GO without software coding, and the necessary software clients to run the sentence-based experiments in the biomedical domain. HESML V2R1 introduces the following novelties: (1) the software implementation of a new package for the evaluation of sentence similarity methods; (2) the software implementation of most of the sentence similarity methods in the biomedical domain; (3) the implementation of a new package for sentence pre-processing together with a set of sentence pre-processing configurations; (4) the integration of the three main biomedical NER tools, Metamap [3], MetamapLite [7] and cTAKES [31]; (5) the software implementation of a parser based on the averaging Simple Word EMbeddings (SWEM) models introduced by Shen et al. [32] for efficiently loading and evaluating FastText-based [4] and other word embedding models; (6) the integration of Python wrappers for the evaluation of BERT [8], Universal Sentence Encoder (USE) [5] and Flair [1] models; and finally, (7) the software implementation of a new string-based sentence similarity method based on the aggregation of the Li et al. [29] similarity and Block distance [9] measures, called LiBlock, as well as eight new variants of the ontology-based methods proposed by Sogancioglu et al. [33], and a new pre-trained word embedding model based on FastText [4] and trained on the full-text of the articles in the PMC-BioC corpus [6]. HESML library is freely distributed for any non-commercial purpose under a CC By-NC-SA-4.0 license, subject to the citing of the two mains HESML papers [24] as attribution requirement.However, HESML distribution also includes other datasets, databases or data files whose use require the attribution acknowledgement by any user of HEMSL. Thus, we urge to the HESML users to fulfill with licensing terms related to other resources distributed with the library as detailed in its companion release notes., HESML V2R1 is a Java library developed with NetBeans 8 which compiles and runs in any Docker-based complaint platform., This work was partially supported by the UNED predoctoral grant started in April 2019 (BICI N7, November 19th, 2018)., Esta librerı́a estará disponible de forma permanente y perpetua.

Proyecto: //

DOI: https://doi.org/10.21950/AQLSMV

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/AQLSMV

HANDLE: https://doi.org/10.21950/AQLSMV

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/AQLSMV

PMID: https://doi.org/10.21950/AQLSMV

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/AQLSMV

Ver en: https://doi.org/10.21950/AQLSMV

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/AQLSMV

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/DYAZRE

Dataset. 2020

REPRODUCIBLE EXPERIMENTS ON THE MASTER THESIS: AN EXPERIMENTAL SURVEY OF NAMED ENTITY RECOGNITION METHODS IN THE BIOMEDICAL DOMAIN

Hennig, Sebastian
Garcia-Serrano, Ana

Semantic Textual Similarity (also known as Semantic Short-text Similarity) is a research problem that aims to calculate the similarity among text units (phrases, sentences, paragraphs or texts) focusing on the semantic content. The importance of Semantic Similarity in Natural Language Processing has increased in the last years due to its relevance in many tasks and applications, such as Automatic Summarization, Machine Translation, Question Answering or Semantic Indexing. UB-NER is a self-contained Java software library for benchmarking state-of-the-art STS measures in the biomedical domain. It allows to define and execute a set of experiments combining different measures and preprocessing methods. This dataset contains the reproducibility framework and dependencies, whose aim is to allow the exact replication of unsupervised named entity recognition experiment in the biomedical domain as detailed in "ReproductionProtocol.pdf" file.

Proyecto: //

DOI: https://doi.org/10.21950/DYAZRE

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/DYAZRE

HANDLE: https://doi.org/10.21950/DYAZRE

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/DYAZRE

PMID: https://doi.org/10.21950/DYAZRE

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/DYAZRE

Ver en: https://doi.org/10.21950/DYAZRE

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/DYAZRE

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/EPNXTR

Dataset. 2021

REPRODUCIBLE EXPERIMENTS ON WORD AND SENTENCE SIMILARITY MEASURES FOR THE BIOMEDICAL DOMAIN

Lara-Clares, Alicia
Lastra-Díaz, Juan J.
Garcia-Serrano, Ana

This dataset introduces a set of reproducibility resources with the aim of allowing the exact replication of the experiments introduced by our main paper, which is a reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most of current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate for the first time an unexplored benchmark, called Corpus-Transcriptional-Regulation (CTR); (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of software and data reproducibility resources for methods and experiments in this line of research. This dataset sets a self-contained reproducibility platform which contains the Java source code and binaries of our main benchmark program, as well as a Docker image which allows the exact replication of our experiments in any software platform supported by Docker, such as all Linux-based operating systems, Windows or MacOS. Our benchmark program is distributed with the UMLS SNOMED-CT and MeSH ontologies by courtesy of the US National Library of Medicine (NLM), as well as all needed software components with the aim of making the setup process easier. Our Docker image provides an exact virtual replica of the machine in which we ran our experiments, thus removing the need to carry-out any tedious setup process, such as the setup of the Python virtual environments and other software components.

HESML library is freely distributed for any non-commercial purpose under a CC By-NC-SA-4.0 license, subject to the citing of the two mains HESML papers [17] as attribution requirement. However, HESML distribution also includes other datasets, databases or data files whose use require the attribution acknowledgement by any user of HEMSL. Thus, we urge to the HESML users to fulfill with licensing terms related to other resources distributed with the library as detailed in its companion release notes.

Proyecto: //

DOI: https://doi.org/10.21950/EPNXTR

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/EPNXTR

HANDLE: https://doi.org/10.21950/EPNXTR

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/EPNXTR

PMID: https://doi.org/10.21950/EPNXTR

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/EPNXTR

Ver en: https://doi.org/10.21950/EPNXTR

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/EPNXTR

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/ML9OI9

Dataset. 2021

FORMAL CONCEPT ANALYSIS FOR TOPIC DETECTION: A CLUSTERING QUALITY EXPERIMENTAL ANALYSIS

Castellanos, Angel
Cigarrán, Juan
Garcia-Serrano, Ana

RepLab is a competitive evaluation exercise for Online Reputation Management systems organized as an activity of CLEF. RepLab 2013 focused on the task of monitoring the reputation of entities (companies, organizations, celebrities, etc.) on Twitter. The monitoring task for analysts consists of searching the stream of tweets for potential mentions to the entity, filtering those that do refer to the entity, detecting topics (i.e., clustering tweets by subject) and ranking them based on the degree to which they signal reputation alerts (i.e., issues that may have a substantial impact on the reputation of the entity). The RepLab 2013 task is defined, accordingly, as (multilingual) topic detection combined with priority ranking of the topics, as input for reputation monitoring experts. The detection of reputational polarity (does the tweet have negative/positive implications for the reputation of the entity?) is an essential step to assign priority, and was evaluated as a standalone subtask, Application of Formal Concept Analysis (FCA), an exploratory technique for data analysis and organization. In particular, we propose an extension of FCA-based methods for topic detection applied in the literature by applying the stability concept for the topic selection. The hypothesis is that FCA will enable the better organization of the data and stability the better selection of topics based on this data organization, thus better fulfilling the task requirements by improving the quality and accuracy of the topic detection process, FCA.tar.gz (about 3MB) This file contains the FCA implementation as well as the input files for the execution The dataset can be downloaded from the official RepLab webpage: http://nlp.uned.es/replab2013/.

Proyecto: //

DOI: https://doi.org/10.21950/ML9OI9

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/ML9OI9

HANDLE: https://doi.org/10.21950/ML9OI9

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/ML9OI9

PMID: https://doi.org/10.21950/ML9OI9

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/ML9OI9

Ver en: https://doi.org/10.21950/ML9OI9

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/ML9OI9

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/OTDA4Z

Dataset. 2020

REPRODUCIBILITY DATASET FOR A BENCHMARK OF BIOMEDICAL SEMANTIC MEASURES LIBRARIES

Lastra-Díaz, Juan J.
Lara-Clares, Alicia
Garcia-Serrano, Ana

This dataset introduces a set of reproducibility resources with the aim of allowing the exact replication of the experiments introduced by our companion paper, which compare the performance of the three UMLS-based semantic similarity libraries reported in the literature as follows: (1) UMLS::Similarity [20], (2) Semantic Measures Library (SML) [3], and the latest version of our Half-Edge Semantic Measures Library (HESML) introduced in our aforementioned companion paper. HESML V1R5 is the fifth release of our Half-Edge Semantic Measures Library (HESML) detailed in [15] which is a linearly scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models for ontologies like WordNet, SNOMED-CT, MeSH and GO. This dataset sets a self-contained reproducibility platform which contains the Java source code and binaries of our main benchmark program, as well as a Docker image which allows the exact replication of our experiments in any software platform supported by Docker, such as all Linux-based operating systems, Windows or MacOS. Our benchmark program is distributed with the UMLS SNOMED-CT and MeSH ontologies by courtesy of the US National Library of Medicine (NLM), as well as all needed software components with the aim of making the setup process easier. Our Docker image provides an exact virtual replica of the machine in which we ran our experiments, thus removing the need to carry-out any tedious setup process, such as the setup of the UMLS Metathesaurus on MySQL database, UMLS::Similarity library and other software components.

Proyecto: //

DOI: https://doi.org/10.21950/OTDA4Z

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/OTDA4Z

HANDLE: https://doi.org/10.21950/OTDA4Z

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/OTDA4Z

PMID: https://doi.org/10.21950/OTDA4Z

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/OTDA4Z

Ver en: https://doi.org/10.21950/OTDA4Z

e-cienciaDatos, Repositorio de Datos del Consorcio Madroño

doi:10.21950/OTDA4Z

Buscador avanzado

Guía de uso

BUSCADOR RECOLECTA

HESML V1R5 JAVA SOFTWARE LIBRARY OF ONTOLOGY-BASED SEMANTIC SIMILARITY MEASURES AND INFORMATION CONTENT MODELS

WORD SIMILARITY BENCHMARKS OF RECENT WORD EMBEDDING MODELS AND ONTOLOGY-BASED SEMANTIC SIMILARITY MEASURES

HESML V2R1 JAVA SOFTWARE LIBRARY OF SEMANTIC SIMILARITY MEASURES FOR THE BIOMEDICAL DOMAIN

REPRODUCIBLE EXPERIMENTS ON THE MASTER THESIS: AN EXPERIMENTAL SURVEY OF NAMED ENTITY RECOGNITION METHODS IN THE BIOMEDICAL DOMAIN

REPRODUCIBLE EXPERIMENTS ON WORD AND SENTENCE SIMILARITY MEASURES FOR THE BIOMEDICAL DOMAIN

FORMAL CONCEPT ANALYSIS FOR TOPIC DETECTION: A CLUSTERING QUALITY EXPERIMENTAL ANALYSIS

REPRODUCIBILITY DATASET FOR A BENCHMARK OF BIOMEDICAL SEMANTIC MEASURES LIBRARIES