2nd Call for Participation - VarDial Evaluation Campaign 2021

Within the scope of the eighth VarDial workshop, co-located with EACL 2021, we 
are organizing an evaluation campaign on similar languages, varieties and 
dialects with four shared tasks.

URL: https://sites.google.com/view/vardial2021/evaluation-campaign

To participate and to receive the training data please fill the registration 
form available on the workshop website. The training sets are already available.

The tasks we are organizing this year are the following (please check the 
website for more information):

- DLI - Dravidian Language Identification: Dravidian languages are a language 
family spoken mainly in the south of India. The four major literary Dravidian 
languages are Tamil (ISO 639-3: tam), Telugu (ISO 639-3: tel), Malayalam (ISO 
639-3: mal), and Kannada (ISO 639-3: kan). Tamil, Malayalam, and Kannada are 
closely related belonging to the South Dravidian subgroup. The DLI shared task 
provides participants with a collection of 16,672 YouTube comments as training 
set. The comments contain code-mixed sentences with English and one of the 
South Dravidian language (Tamil, Malayalam or Kannada). All comments were 
written in Roman script (Non-native script). The task is to identify the 
language of each comment.

- RDI - Romanian Dialect Identification: In this second iteration of the 
Romanian Dialect Identification (RDI) shared task we provide participants with 
an augmented version of the MOROCO data set for training, which contains 
Moldavian (MD) and Romanian (RO) samples of text collected from the news 
domain. A new test set has been collected which will allow participants to 
improve the results they obtained in VarDial 2020. The task is a binary 
classification by dialect, in which a classification model is required to 
discriminate between the Moldavian (MD) and the Romanian (RO) dialects. The 
task is closed, therefore, participants are not allowed to use external data to 
train their models. The test set will contain newly collected text samples, not 
previously included in MOROCO. The test samples will come from a different 
domain, hence the methods have to take the cross-domain nature of the task into 
account. RDI participants may use other external resources in their systems, 
e.g. unlabelled corpora, lexicons, pre-trained embeddings, etc.

- SMG - Social Media Variety Geolocation: In this second iteration of the SMG 
task, we again focus on a geolocation (rather than identification) task: given 
a text, the participants have to predict its geographic location in terms of 
latitude/longitude coordinates. Using data from the social media platforms 
Twitter and Jodel, we provide extended datasets for the same three subtasks as 
in 2020:: 1. Standard German Jodels; 2. Swiss German Jodels; 3. BCMS Tweets. 
All three subtasks will use the same data format and evaluation methodology, 
and participants are encouraged to submit their systems for all subtasks.

- ULI - Uralic Language Identification: This task focuses on discriminating 
between the languages in the Uralic group as defined by the ISO 639-3 standard. 
This is an open public leaderboard competition following VarDial 2020 where 
participants can submit at any point until the final submission date. The task 
includes 29 individual relevant languages, some of which are extremely closely 
related and similar, such as Kven Finnish (fkv) and Tornedalen Finnish (fit). 
These languages are used from Scandinavia, Estonia, and Finland all the way to 
the Russian Siberia.

Best,
Marcos
_______________________________________________
Mt-list site list
Mt-list@eamt.org
http://lists.eamt.org/mailman/listinfo/mt-list

Reply via email to