EMNLP 2015 Workshop on Discourse in Machine Translation (DiscoMT'15)
(http://www.idiap.ch/workshop/DiscoMT)
17 September 2015 -- Lisbon, Portugal
Final call for papers - Submission deadline: 28 June 2015
It is well-known that texts have properties that go beyond those
Now that I think of it, truecasing should not change file sizes, after
all it only substitutes single letters with their smaller versions, to
the file should stay the same size. Unless Samoan has some weird utf-8
letters that have different byte sizes between captialized and
uncapitalized
I checked for some of my experiments and I get nearly identical bleu
scores when using the standard weights, differences are on the second
place behind the comma if at all. These results now seem more likely,
though there is still variance.
I am still wondering why would true casing produce
I think you are good now. That's what I am getting for a 500 sentences
test set, trained on 10,000 sentences. Similar to your results. For a
larger test set (4000 sentences) and the same training data there is
nearly no variance, 12.89 vs. 12.91. So now you need to scale up and tune.
BLEU =
Nice thanks. Yeah the truecased files I checked had about 18 or so
differences where one file would capitalise the first letter and the other
file wouldn't. I am going to try and compile more data. But I think I will
only manage to get about 10k to 15k parallel segments altogether. Took me
quite a