Re: [Wikimedia-l] machine translation

Amir E. Aharoni Tue, 02 May 2017 12:26:33 -0700

2017-05-02 21:47 GMT+03:00 John Erling Blad <[email protected]>:

> Yandex as a general translation engine to be able to read some alien
> language is quite good, but as an engine to produce written text it is not
> very good at all.



... Nor is it supposed to be.

A translator is a person. Machine translation software is not a person,
it's software. It's a tool that is supposed to help a human translator
produce a good written text more quickly. If it doesn't make this work
faster, it can and should be disabled. If no translator


> In fact it often creates quite horrible Norwegian, even
> for closely related languages. One quite common problem is reordering of
> words into meaningless constructs, an other problem is reordering lexical
> gender in weird ways. The English preposition "a" is often translated as
> "en" in a propositional phrase, and then the gender is added to the
> following phrase. That gives a translation of  "Oppland is a county in…"
>  into something like "Oppland er en fylket i…" This should be "Oppland er
> et fylke i…".
>

I suggest making a page with a list of such examples, so that the machine
translation developers could read it.


> (I just checked and it seems like Yandex messes up a lot less now than
> previously, but it is still pretty bad.)
>

I guess that this is something that Yandex developers will be happy to hear
:)

More seriously, it's quite possible that they already used some of the
translations made by the Norwegian Wikipedia community. In addition to
being published as an article, each translated paragraph is saved into
parallel corpora, and machine translation developers read the edited text
and use it to improve their software. This is completely open and usable by
all machine translation developers, not only for Yandex.



> The numerical threshold does not work. The reason is simple, the number of
> fixes depends on language constructs that fails, and that is simply not a
> constant for small text fragments. Perhaps if we could flag specific
> language constructs that is known to give a high percentage of failures,
> and if the translator must check those sentences. One such language
> construct is disappearances between the preposition and the gender of the
> following term in a prepositional phrase.
>

The question is how would we do it with our software. I simply cannot
imagine doing it with the current MediaWiki platform, unless we develop a
sophisticated NLP engine, although it's possible I'm exaggerating or
forgetting something.


> A language model could be a statistical model for the language itself, not
> for the translation into that language. We don't want a perfect language
> model, but a sufficient language model to mark weird constructs. A very
> simple solution could simply be to mark tri-grams that does not  already
> exist in the text base for the destination as possible errors. It is not
> necessary to do a live check, but  at least do it before the page can be
> saved.
>

See above—we don't have support for plugging something like that into our
workflow.

Perhaps one day some AI/machine-learning system like ORES would be able to
do it. Maybe it could be an extension to ORES itself.


> Note the difference in what Yandex do and what we want to achieve; Yandex
> translates a text between two different languages, without any clear reason
> why. It is not to important if there are weird constructs in the text, as
> long as it is usable in "some" context. We translate a text for the purpose
> of republishing it. The text should be usable and easily readable in that
> language.
>

This is a well-known problem in machine translation: domain.

Professional industrial translation powerhouses use internally-customized
machine translation engines that specialize on particular domains, such as
medicine, law, or news. In theory, it would make a lot of sense to have a
customized machine translation engine for encyclopedic articles, or maybe
even for several different styles of encyclopedic articles (biography,
science, history, etc.). For now what we have is a very general-purpose
consumer-oriented engine. I hope it changes in the future.
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:[email protected]?subject=unsubscribe>

Re: [Wikimedia-l] machine translation

Reply via email to