Second Call for ParticipationWMT 2020 Shared TaskParallel Corpus Filtering
and Alignment for Low-Resource Conditions
SEE UPDATES BELOW

http://www.statmt.org/wmt20/parallel-corpus-filtering.html

We announce and call for participation in the WMT 2020 shared task on
assessing the quality of sentence pairs in a parallel corpus.

   - In the WMT18 shared task on parallel corpus filtering
   <http://www.statmt.org/wmt18/parallel-corpus-filtering.html>, we posed
   the challenge of a noisy web-crawled parallel corpus for German-English and
   asked participants to score each sentence pair. These quality scores were
   used to select subsets of the corpus, consisting of the highest-scoring
   sentence pairs, train statistical and neural machine translation systems on
   them, and evaluate these on a set of test sets.
   - In the WMT19 shared task on parallel corpus filtering for low resource
   conditions <http://www.statmt.org/wmt19/parallel-corpus-filtering.html>,
   we followed the same protocol, but this time for Nepali-English and
   Sinhala-English. For low-resource language pairs like these, both existing
   clean parallel corpora and the to-be-scored noisy web-crawled data comes in
   smaller amounts and lower quality.

This year, we pose two different language pairs, Khmer-English and
Pashto-English. In addition to the task of computing quality scores for the
purpose of filtering, we also allow for the re-alignment of sentence pairs
from document pairs.
IMPORTANT UPDATESIn sync with changes in deadlines for EMNLP and WMT, we
also pushed back all deadlines by 1 month to:
Submission deadline for subsampled sets    August 1, 2020
System descriptions due August 15, 2020
Announcement of results August 29, 2020
Paper notification September 29, 2020
Camera-ready for system descriptions October 10, 2020MISSING SENTENCE PAIRS
IN DOCUMENT PAIRSUnfortunately, some of the document pairs are missing for
sentence pairs included in the sentence aligned set. So, if you are running
your own document alignment, we provided sets of sentence pairs to be
included to the sentence pairs that you extract yourself from the document
pairs (see Section "RAW CORPUS DOWNLOAD / Document Pairs" on the web page).
BASELINE RESULTS
We noticed that with different GPU hardware, different scores (±1 BLEU
point) are obtained on the same sets. There is also some variance with
different seeds. While you may observe different numbers from the ones
listed on the web page, all final scoring will be done on identical
hardware for all participants to ensure fair assessments.
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to