NERMuD 2023

Named-Entities Recognition on Multi-Domain Documents @ EVALITA 2023

NEW! 12/05/2023 - Test data available on the Github repo.

NEW! 30/04/2023 - Evaluation windows disclosed: May 12th-19th.

NEW! 07/02/2023 - Train and development data is released on the KIND project page. If you want to keep in touch about the task, just subscribe here.

Task description

NERMuD is a task presented at EVALITA 2023 consisting in the extraction and classification of named-entities in a document, such as persons, organizations, and locations.

NERMuD 2023 will include two different sub-tasks:

The two classification tasks can be addressed in several ways. For example, using deep learning techniques or the addition of external data such as gazetteers.

The task has already been addressed in almost all languages (Goyal et al., 2018), demonstrating a huge interest in the topic. In fact, extracting entities from texts is a relevant task because it can play a role in document retrieval, summarisation, event detection, etc. It is also an important task per se, since it can be used to process large archival collections. Although named-entities recognition is believed to be a solved task, some studies show that this is far from true, and that depending on the labels, languages, and topic there is always room for improvement (Marrero et al., 2013).

While the task of extracting named-entities has been studied for a long time in the last years, usually only news (and, recently, social media) are considered when releasing datasets (I-CAB, Magnini et al., 2006) or proposing tasks (NEEL-IT 2016, Basile et al., 2016; NER 2011, Bartalesi Lenzi et al., 2011).

Data and annotation

The corpus that can be used for training is named KIND (Paccosi et al., 2022) and is already available on GitHub, since it has been presented at LREC 2022.

The original dataset contains more than one million tokens with the annotation covering three classes: person, location, and organization. Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses. This subset will be used as training data for the NERMuD 2023 task.

All the texts used for the annotation are publicly available, under a license that permits both research and commercial use. In particular, the texts used for NERMuD are taken from:

Since the dataset is already publicly released and available, a new set of data will be annotated and shared using the same guidelines (available on the KIND website).

The dataset complies with ethical standards and has been collected in a manner which is consistent with the terms of use of any sources and the intellectual property and privacy rights of the original authors of the texts.

Annotation scheme

The data is annotated using the Inside-Outside-Begining (IOB) tagging scheme, which subdivides the elements of every entity as begin-of-entity (B-ent) or continuation-of-entity (I-ent). According to this format, the entities are marked such as the first element of the “compound” entity is B-ent and the following ones are I-ents.

Person are marked with PER, organizations with ORG, and locations with LOC.

This is an example of annotation (fields are tab separated, sentences are empty-line-separated):


L' O

astronauta O

Umberto B-PER

Guidoni I-PER

, O

dell' O

Agenzia B-ORG

Spaziale I-ORG

Europea I-ORG

, O

svela O

ai O

bambini O

i O

segreti O

della O

Stazione B-LOC

Spaziale I-LOC

Internazionale I-LOC

. O

Dataset description

Wikinews

Wikinews is a multi-language free-content project of collaborative journalism. The Italian chapter contains more than 11,000 news articles, released under the Creative Commons Attribution 2.5 License. In building the dataset, we randomly choose 1,000 articles evenly distributed in the last 20 years, for a total of 308,622 tokens.

An estimation of 150/200 new articles that will be included in the test set.

Fiction books

Regarding fiction literature, KIND contains 86 chapters taken from 10 books written by Italian authors, who all died more than 70 years ago, for a total of 192,448 tokens. The plain texts are taken from the Liber Liber website.

In selecting works in the public domain, we favored texts as recent as possible, so that the model trained on this data can be used efficiently on novels written in the last years, since the language used in these novels is more likely to be similar to the language used in the novels of our days.

For the test set, we will add 10/15 new book chapters. To increase the difficulty of the task, we may include in the test set texts from authors not selected for training purpose.

Political speeches

We annotate 158 document (150,632 tokens) from Alcide Digitale, spanning 50 years of European history. The complete corpus contains a comprehensive collection of Alcide De Gasperi’s public documents, 2,762 in total, written or transcribed between 1901 and 1954.

In the test set we will include at least 15/20 documents taken from the same corpus and annotated following the same guidelines.

Task details

Participants are required to submit their runs and to provide a technical report that should include a brief description of their approach, focusing on the adopted algorithms, models and resources, a summary of their experiments, and an analysis of the obtained results.

Each participant can submit up to 3 runs for each sub-task.

Runs should be a TSV file with fields delimited by a tab and it should follow the same format of the training dataset. No missing data are allowed: a label should be predicted for each token in the test set.

Once the system has produced the results for the task over the test set, participants have to follow these instructions for completing your submission:

Evaluation metric

Final results will be calculated in terms of macro-average F1. An evaluation script will be released soon.

Baseline

We propose two baselines, both used for the evaluation of the KIND dataset.

Important dates

Contacts

References