Named-Entities Recognition on Multi-Domain Documents @ EVALITA 2023
NEW! 07/02/2023 - Train and development data is released on the KIND project page. If you want to keep in touch about the task, just subscribe here.
NERMuD is a task presented at EVALITA 2023 consisting in the extraction and classification of named-entities in a document, such as persons, organizations, and locations.
NERMuD 2023 will include two different sub-tasks:
Domain-agnostic classification (DAC). Participants will be asked to select and classify entities among three categories (person, organization, location) in different types of texts (news, fiction, political speeches) using one single general model.
Domain-specific classification (DSC). Participants will be asked to deploy a different model for each of the above types, trying to increase the accuracy for each considered type.
The two classification tasks can be addressed in several ways. For example, using deep learning techniques or the addition of external data such as gazetteers.
The task has already been addressed in almost all languages (Goyal et al., 2018), demonstrating a huge interest in the topic. In fact, extracting entities from texts is a relevant task because it can play a role in document retrieval, summarisation, event detection, etc. It is also an important task per se, since it can be used to process large archival collections. Although named-entities recognition is believed to be a solved task, some studies show that this is far from true, and that depending on the labels, languages, and topic there is always room for improvement (Marrero et al., 2013).
While the task of extracting named-entities has been studied for a long time in the last years, usually only news (and, recently, social media) are considered when releasing datasets (I-CAB, Magnini et al., 2006) or proposing tasks (NEEL-IT 2016, Basile et al., 2016; NER 2011, Bartalesi Lenzi et al., 2011).
Data and annotation
The corpus that can be used for training is named KIND (Paccosi et al., 2022) and is already available on GitHub, since it has been presented at LREC 2022.
The original dataset contains more than one million tokens with the annotation covering three classes: person, location, and organization. Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses. This subset will be used as training data for the NERMuD 2023 task.
All the texts used for the annotation are publicly available, under a license that permits both research and commercial use. In particular, the texts used for NERMuD are taken from:
Wikinews (WN) as a source of news texts belonging to the last decades;
Some Italian fiction books (FIC) in the public domain;
Writings and speeches from the Italian politician Alcide De Gasperi (ADG).
Since the dataset is already publicly released and available, a new set of data will be annotated and shared using the same guidelines (available on the KIND website).
The data is annotated using the Inside-Outside-Begining (IOB) tagging scheme, which subdivides the elements of every entity as begin-of-entity (B-ent) or continuation-of-entity (I-ent). According to this format, the entities are marked such as the first element of the “compound” entity is B-ent and the following ones are I-ents.
Person are marked with PER, organizations with ORG, and locations with LOC.
This is an example of annotation (fields are tab separated, sentences are empty-line-separated):
Wikinews is a multi-language free-content project of collaborative journalism. The Italian chapter contains more than 11,000 news articles, released under the Creative Commons Attribution 2.5 License. In building the dataset, we randomly choose 1,000 articles evenly distributed in the last 20 years, for a total of 308,622 tokens.
An estimation of 150/200 new articles that will be included in the test set.
Regarding fiction literature, KIND contains 86 chapters taken from 10 books written by Italian authors, who all died more than 70 years ago, for a total of 192,448 tokens. The plain texts are taken from the Liber Liber website.
In selecting works in the public domain, we favored texts as recent as possible, so that the model trained on this data can be used efficiently on novels written in the last years, since the language used in these novels is more likely to be similar to the language used in the novels of our days.
For the test set, we will add 10/15 new book chapters. To increase the difficulty of the task, we may include in the test set texts from authors not selected for training purpose.
We annotate 158 document (150,632 tokens) from Alcide Digitale, spanning 50 years of European history. The complete corpus contains a comprehensive collection of Alcide De Gasperi’s public documents, 2,762 in total, written or transcribed between 1901 and 1954.
In the test set we will include at least 15/20 documents taken from the same corpus and annotated following the same guidelines.
Participants are required to submit their runs and to provide a technical report that should include a brief description of their approach, focusing on the adopted algorithms, models and resources, a summary of their experiments, and an analysis of the obtained results.
Each participant can submit up to 3 runs for each sub-task.
Runs should be a TSV file with fields delimited by a tab and it should follow the same format of the training dataset. No missing data are allowed: a label should be predicted for each token in the test set.
Once the system has produced the results for the task over the test set, participants have to follow these instructions for completing your submission:
choose a team name (teamName in the following);
name the runs with the following filename format: subtask_teamName_runID. For example: DAC_fbk_1 would be the first run of a team called fbk for the domain-agnostic classification sub-task whereas DSC_fbk_2 would be the second run of a team called fbk for the domain-specific classification sub-task;
send the runs to the following email address: email@example.com, using the subject “NERMuD 2023 submission - teamName”.
Final results will be calculated in terms of macro-average F1. An evaluation script will be released soon.
We propose two baselines, both used for the evaluation of the KIND dataset.
Conditional Random Fields (CRF) model, using the implementation included in Stanford CoreNLP (Manning et al., 2014), already used in its NER module (Finkel et al., 2005).
BERT (Devlin et al., 2019) NER classification partially inspired by a blog post by Tobias Sterbak, using BertForTokenClassification from Hugging Face. A working implementation can be found on Github.
Training and development data release: February 7th, 2023
Registration closes: April 30th, 2023
Evaluation windows: TBA (around May, 2023)
Assessment returned to participants: May 30th, 2023
Submission deadline for technical report: June 14th, 2023
Review deadline: July 10th, 2023
Camera-ready version of the report: July 25th, 2023
Final workshop in Parma: September 7th-8th, 2023
Teresa Paccosi (Fondazione Bruno Kessler) firstname.lastname@example.org
Alessio Palmero Aprosio (Fondazione Bruno Kessler) email@example.com
Archana Goyal, Vishal Gupta, Manish Kumar. Recent Named Entity Recognition and Classification techniques: A systematic review. Computer Science Review, Volume 29, 2018, Pages 21-43, ISSN 1574-0137.
Bartalesi Lenzi, Valentina, Speranza, Manuela, Sprugnoli, Rachele. Named Entity Recognition on Transcribed Broadcast News at EVALITA 2011. Springer Berlin Heidelberg 2013.
Basile, Pierpaolo, A. Caputo, Anna Lisa Gentile and Giuseppe Rizzo. Overview of the EVALITA 2016 Named Entity rEcognition and Linking in Italian Tweets (NEEL-IT) Task. CLiC-it/EVALITA (2016).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.
Bernardo Magnini, Emanuele Pianta, Christian Girardi, Matteo Negri, Lorenza Romano, Manuela Speranza, Valentina Bartalesi Lenzi and Rachele Sprugnoli. I-CAB: the Italian Content Annotation Bank. Proceedings of LREC 2006 - 5th Conference on Language Resources and Evaluation, 22-28/5/2006, Genova (IT)
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, Juan Miguel Gómez-Berbís. Named Entity Recognition: Fallacies, challenges and opportunities. Computer Standards & Interfaces, Volume 35, Issue 5, 2013, Pages 482-489, ISSN 0920-5489.
Teresa Paccosi and Alessio Palmero Aprosio. KIND: an Italian Multi-Domain Dataset for Named Entity Recognition. Proceedings of the 13th Conference on Language Resources and Evaluation 2022 (LREC 2022) [arXiv]