Re: [Wikimedia-l] machine translation

Wojciech Pędzich Wed, 03 May 2017 01:32:49 -0700

It does depend a lot on the engagement level of the human behind thekeyboard. When I deal with machine-translated text, I simply wonderwhether the someone behind the keyboard took efforts to actually readthe piece.

Now whether this would work if limited to namespaces outside "main" - Ido not want to demonise the issue, but if the person submitting the textfor machine translation does not read it, what will stop them from aquick ctrl+c / ctrl+v? Just asking.


Wojciech

W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:

Creating machine translations only in the draft space (or in the user space
in the projects which do not have draft) could help.

Cheers
Yaroslav

On Tue, May 2, 2017 at 10:16 PM, Pharos <[email protected]>
wrote:

I think it all depends on the level of engagement of the human translator.

When the tool is used in the right way, it is a fantastic tool.

Maybe we can find better methods to nudge people toward taking their time
and really doing work on their translations.

Thanks,
Pharos

On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
[email protected]> wrote:

Content translation with Yandex is also a problem in Bengali Wikipedia.
Some users have grown a tendency to create machine translated meaningless
articles with this extension to increase edit count and article count.

This

has increased the workloads of admins to find and delete those articles.

Yandex is not ready for many languages and it is better to shut it. We
don't need it in Bengali.

Regards
On May 3, 2017 12:17 AM, "John Erling Blad" <[email protected]> wrote:

Actually this _is_ about turning ContentTranslation off, that is what
several users in the community want. They block people using the

extension

and delete the translated articles. Use of ContentTranslation has

become

  rather contentious case.

Yandex as a general translation engine to be able to read some alien
language is quite good, but as an engine to produce written text it is

not

very good at all. In fact it often creates quite horrible Norwegian,

even

for closely related languages. One quite common problem is reordering

of

words into meaningless constructs, an other problem is reordering

lexical

gender in weird ways. The English preposition "a" is often translated

as

"en" in a propositional phrase, and then the gender is added to the
following phrase. That gives a translation of  "Oppland is a county

in…"

  into something like "Oppland er en fylket i…" This should be "Oppland

er

et fylke i…".

(I just checked and it seems like Yandex messes up a lot less now than
previously, but it is still pretty bad.)

Apertium works because the language is closely related, Yandex does not
work because it is used between very different languages. People try to

use

Yandex and gets disappointed, and falsely conclude that all language
translations are equally weird. They are not, but Yandex translations

are

weird.

The numerical threshold does not work. The reason is simple, the number

of

fixes depends on language constructs that fails, and that is simply

not a

constant for small text fragments. Perhaps if we could flag specific
language constructs that is known to give a high percentage of

failures,

and if the translator must check those sentences. One such language
construct is disappearances between the preposition and the gender of

the

following term in a prepositional phrase. If they are not similar, then

the

sentence must be checked. It is not always wrong to write "en jenta" in
Norwegian, but it is likely to be wrong.

A language model could be a statistical model for the language itself,

not

for the translation into that language. We don't want a perfect

language

model, but a sufficient language model to mark weird constructs. A very
simple solution could simply be to mark tri-grams that does not

already

exist in the text base for the destination as possible errors. It is

not

necessary to do a live check, but  at least do it before the page can

be

saved.

Note the difference in what Yandex do and what we want to achieve;

Yandex

translates a text between two different languages, without any clear

reason

why. It is not to important if there are weird constructs in the text,

as

long as it is usable in "some" context. We translate a text for the

purpose

of republishing it. The text should be usable and easily readable in

that

language.



On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
[email protected]> wrote:

2017-05-02 18:20 GMT+03:00 John Erling Blad <[email protected]>:

Brute force solution; turn the ContentTranslation off. Really

stupid

solution.


... Then I guess you don't mind that I'm changing the thread name :)

The next solution; turn the Yandex engine off. That would solve a
part of the problem. Kind of lousy solution though.

What about adding a language model that warns when the language

constructs

gets to weird? It is like a "test" for the translation. The CT is

used

for

creating a translation, but the language model is used for

verifying

if

the

translation is good enough. If it does not validate against the

language

model it should simply not be published to the main name space. It

will

still be possible to create a draft, but then the user is

completely

aware

that the translation isn't good enough.

Such a language model should be available as a test for any

article,

as

it

can be used as a quality measure for the article. It is really a

quantity

measure for the well-spokenness of the article, but that isn't

quite

so

intuitive.

So, I'll allow myself to guess that you are talking about one

particular

language, probably Norwegian.

Several technical facts:

1. In the past there were several cases in which translators to

different

languages who reported common translation mistakes to me. I passed

them

on

to Yandex developers, with whom I communicate quite regularly. They
acknowledged receiving all of them. I am aware of at least one such

common

mistake that was fixed; possibly there were more. If you can give me

list

of such mistakes for Norwegian, I'll be very happy to pass them on. I
absolutely cannot promise that they will be fixed upstream, but it's
possible.

2. In Norwegian, Apertium is used for translating between the two

varieties

of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
languages. That's probably why it works so well—they are similar in
grammar, vocabulary, and narrative style (I'll pass it on to Apertium
developers—I'm sure they'll be happy to hear it). Unfortunately,

machine

translation from English is not available in Apertium. Apertium works

best

with very similar languages, and English has two characteristics,

which

are

unfortunate when combined: it is both the most popular source for
translation into almost all other languages (including Norwegian),

and

it

is not _very_ similar to any other languages (except maybe Scots).

Machine

translation from English into Norwegian is only possible with Yandex

at

the

moment. More engines may be added in the future, but at the moment

that's

all we have. That's why disabling Yandex completely would indeed be a

lousy

solution: A lot of people say that without machine translation

integration

Content Translation is useless. Not all users think like that, but

many

do.

3. We can define a numerical threshold of acceptable percentage of

machine

translation post-editing. Currently it's 75%. It's a tad

embarrassing,

but

it's hard-coded at the moment, but it can be very easily be made

into a

variable per language. If the translator tries to publish a page in

which

less than that is modified, a warning will be shown.

4. I'm not sure what do you mean by "language model". If it's any

kind

of a

linguistic engine, then it's definitely not within the resources that

the

Language team itself can currently dedicate. However, if somebody who

knows

Norwegian and some programming will write a script that analyzes

common

bad

constructs in a Wikipedia dump, this will be very useful. This would
basically be an upgraded version of suggestion #1 above. (In my spare

time

as a volunteer I'm doing something comparable for Hebrew, although

not

for

translation, but for improving how MediaWiki link trails work.)
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
wiki/Wikimedia-l
New messages to: [email protected]
Unsubscribe: https://lists.wikimedia.org/

mailman/listinfo/wikimedia-l,

<mailto:[email protected]?subject=unsubscribe>

_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
wiki/Wikimedia-l
New messages to: [email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:[email protected]?subject=unsubscribe>

_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
wiki/Wikimedia-l
New messages to: [email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:[email protected]?subject=unsubscribe>

_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
wiki/Wikimedia-l
New messages to: [email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:[email protected]?subject=unsubscribe>

_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:[email protected]?subject=unsubscribe>




_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:[email protected]?subject=unsubscribe>

Re: [Wikimedia-l] machine translation

Reply via email to