[Moses-support] CfP: WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment

Philipp Koehn Sat, 28 Mar 2020 13:20:00 -0700

Call for ParticipationWMT 2020 Shared TaskParallel Corpus Filtering and
Alignment for Low-Resource Conditions


http://www.statmt.org/wmt20/parallel-corpus-filtering.html

We announce and call for participation in the WMT 2020 shared task on
assessing the quality of sentence pairs in a parallel corpus.

   - In the WMT18 shared task on parallel corpus filtering
   <http://www.statmt.org/wmt18/parallel-corpus-filtering.html>, we posed
   the challenge of a noisy web-crawled parallel corpus for German-English and
   asked participants to score each sentence pair. These quality scores were
   used to select subsets of the corpus, consisting of the highest-scoring
   sentence pairs, train statistical and neural machine translation systems on
   them, and evaluate these on a set of test sets.
   - In the WMT19 shared task on parallel corpus filtering for low resorce
   conditions <http://www.statmt.org/wmt19/parallel-corpus-filtering.html>,
   we followed the same protocol, but this time for Nepali-English and
   Sinhala-English. For low-resource language pairs like these, both existing
   clean parallel corpora and the to-be-scored noisy web-crawled data comes in
   smaller amounts and lower quality.

This year, we pose two different language pairs, Khmer-English and
Pashto-English. In addition to the task of computing quality scores for the
purpose of filtering, we also allow for the re-alignment of sentence pairs
from document pairs.DETAILSWe provide a very noisy 58.3 million-word
(English token count) Khmer-English corpus and a 11.6 million-word
Pashto-English corpus. These corpora were partly crawled from the web as
part of the Paracrawl <http://paracrawl.eu/> project, and partly extracted
from the CommonCrawl <https://commoncrawl.org/> data set. We ask
participants to provide scores for each sentence in each of the noisy
parallel sets. The scores will be used to subsample sentence pairs that
amount to 5 million English words. The quality of the resulting subsets is
determined by the quality of a neural machine translation system (fairseq)
trained on this data. The quality of the machine translation system is
measured by BLEU score (sacrebleu) on a held-out test set of Wikipedia
translations  <https://github.com/facebookresearch/flores>for Khmer-English
and Pashto-English.

We also provide clean parallel and monolingual training data for the two
language pairs. This existing data comes from a variety of sources and is
of mixed quality and relevance.

Note that the task addresses the challenge of *data quality* and *not
domain-relatedness* of the data for a particular use case. While we provide
a development and development test set that are also drawn from Wikipedia
articles, these may be very different from the final official test set in
terms of topics.

The provided raw parallel corpora are the outcome of a processing pipeline
that aimed from high recall at the cost of precision, so they are very
noisy. They exhibit noise of all kinds (wrong language in source and
target, sentence pairs that are not translations of each other, bad
language, incomplete of bad translations, etc.).

This year, we also provide the document pairs from which the sentence pairs
were extracted (using Hunalign <http://mokk.bme.hu/en/resources/hunalign/>
 and LASER <https://github.com/facebookresearch/LASER>). You may align
sentences yourself from these document pairs, thus producing your own set
of sentence pairs. If you opt to do this, you have to submit all aligned
sentence pairs and their quality scores.
IMPORTANT DATES
Release of raw parallel data March 28, 2020
Submission deadline for subsampled sets    July 1, 2020
System descriptions due July 15, 2020
Announcement of results June 29, 2020
Paper notification August 17, 2020
Camera-ready for system descriptions
ORGANIZERSPhilipp Koehn, Johns Hopkins University
Francisco (Paco) Guzmán, Facebook
Vishrav Chaudhary, Facebook
Ahmed Kishky, Facebook
Naman Goyal, Facebook
Peng-Jen Chen, Facebook

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] CfP: WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment

Reply via email to