Re: [Wikimedia-l] machine translation

John Erling Blad Wed, 03 May 2017 04:23:42 -0700

Agree! I also wonder if translators adapt to specific errors if they are
repeated to often. I wonder if it works like priming the brain to a
specific pattern.


On Wed, May 3, 2017 at 1:15 PM, Lodewijk <[email protected]>
wrote:

> Reading this, I get a strong impression the problem may very well be in
> setting expectations for the users of this translation tool. If they expect
> the automated translation to be rather good, they may get fed up more
> easily than when they consider it primarily a glorified dictionary.
>
> Lodewijk
>
> On Wed, May 3, 2017 at 1:06 PM, David Cuenca Tudela <[email protected]>
> wrote:
>
> > Perhaps it would be a good idea to compare the translated text to the
> text
> > that the user wants to save.
> >
> > If they are more than 95% the same, that means that the user didn't take
> > the effort to correct the text.
> >
> > Cheers,
> > Micru
> >
> > On Wed, May 3, 2017 at 10:31 AM, Wojciech Pędzich <[email protected]>
> > wrote:
> >
> > > It does depend a lot on the engagement level of the human behind the
> > > keyboard. When I deal with machine-translated text, I simply wonder
> > whether
> > > the someone behind the keyboard took efforts to actually read the
> piece.
> > >
> > > Now whether this would work if limited to namespaces outside "main" - I
> > do
> > > not want to demonise the issue, but if the person submitting the text
> for
> > > machine translation does not read it, what will stop them from a quick
> > > ctrl+c / ctrl+v? Just asking.
> > >
> > > Wojciech
> > >
> > > W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:
> > >
> > > Creating machine translations only in the draft space (or in the user
> > space
> > >> in the projects which do not have draft) could help.
> > >>
> > >> Cheers
> > >> Yaroslav
> > >>
> > >> On Tue, May 2, 2017 at 10:16 PM, Pharos <[email protected]
> >
> > >> wrote:
> > >>
> > >> I think it all depends on the level of engagement of the human
> > translator.
> > >>>
> > >>> When the tool is used in the right way, it is a fantastic tool.
> > >>>
> > >>> Maybe we can find better methods to nudge people toward taking their
> > time
> > >>> and really doing work on their translations.
> > >>>
> > >>> Thanks,
> > >>> Pharos
> > >>>
> > >>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
> > >>> [email protected]> wrote:
> > >>>
> > >>> Content translation with Yandex is also a problem in Bengali
> Wikipedia.
> > >>>> Some users have grown a tendency to create machine translated
> > >>>> meaningless
> > >>>> articles with this extension to increase edit count and article
> count.
> > >>>>
> > >>> This
> > >>>
> > >>>> has increased the workloads of admins to find and delete those
> > articles.
> > >>>>
> > >>>> Yandex is not ready for many languages and it is better to shut it.
> We
> > >>>> don't need it in Bengali.
> > >>>>
> > >>>> Regards
> > >>>> On May 3, 2017 12:17 AM, "John Erling Blad" <[email protected]>
> wrote:
> > >>>>
> > >>>> Actually this _is_ about turning ContentTranslation off, that is
> what
> > >>>>> several users in the community want. They block people using the
> > >>>>>
> > >>>> extension
> > >>>>
> > >>>>> and delete the translated articles. Use of ContentTranslation has
> > >>>>>
> > >>>> become
> > >>>
> > >>>> a
> > >>>>
> > >>>>>   rather contentious case.
> > >>>>>
> > >>>>> Yandex as a general translation engine to be able to read some
> alien
> > >>>>> language is quite good, but as an engine to produce written text it
> > is
> > >>>>>
> > >>>> not
> > >>>>
> > >>>>> very good at all. In fact it often creates quite horrible
> Norwegian,
> > >>>>>
> > >>>> even
> > >>>
> > >>>> for closely related languages. One quite common problem is
> reordering
> > >>>>>
> > >>>> of
> > >>>
> > >>>> words into meaningless constructs, an other problem is reordering
> > >>>>>
> > >>>> lexical
> > >>>
> > >>>> gender in weird ways. The English preposition "a" is often
> translated
> > >>>>>
> > >>>> as
> > >>>
> > >>>> "en" in a propositional phrase, and then the gender is added to the
> > >>>>> following phrase. That gives a translation of  "Oppland is a county
> > >>>>>
> > >>>> in…"
> > >>>
> > >>>>   into something like "Oppland er en fylket i…" This should be
> > "Oppland
> > >>>>>
> > >>>> er
> > >>>
> > >>>> et fylke i…".
> > >>>>>
> > >>>>> (I just checked and it seems like Yandex messes up a lot less now
> > than
> > >>>>> previously, but it is still pretty bad.)
> > >>>>>
> > >>>>> Apertium works because the language is closely related, Yandex does
> > not
> > >>>>> work because it is used between very different languages. People
> try
> > to
> > >>>>>
> > >>>> use
> > >>>>
> > >>>>> Yandex and gets disappointed, and falsely conclude that all
> language
> > >>>>> translations are equally weird. They are not, but Yandex
> translations
> > >>>>>
> > >>>> are
> > >>>
> > >>>> weird.
> > >>>>>
> > >>>>> The numerical threshold does not work. The reason is simple, the
> > number
> > >>>>>
> > >>>> of
> > >>>>
> > >>>>> fixes depends on language constructs that fails, and that is simply
> > >>>>>
> > >>>> not a
> > >>>
> > >>>> constant for small text fragments. Perhaps if we could flag specific
> > >>>>> language constructs that is known to give a high percentage of
> > >>>>>
> > >>>> failures,
> > >>>
> > >>>> and if the translator must check those sentences. One such language
> > >>>>> construct is disappearances between the preposition and the gender
> of
> > >>>>>
> > >>>> the
> > >>>
> > >>>> following term in a prepositional phrase. If they are not similar,
> > then
> > >>>>>
> > >>>> the
> > >>>>
> > >>>>> sentence must be checked. It is not always wrong to write "en
> jenta"
> > in
> > >>>>> Norwegian, but it is likely to be wrong.
> > >>>>>
> > >>>>> A language model could be a statistical model for the language
> > itself,
> > >>>>>
> > >>>> not
> > >>>>
> > >>>>> for the translation into that language. We don't want a perfect
> > >>>>>
> > >>>> language
> > >>>
> > >>>> model, but a sufficient language model to mark weird constructs. A
> > very
> > >>>>> simple solution could simply be to mark tri-grams that does not
> > >>>>>
> > >>>> already
> > >>>
> > >>>> exist in the text base for the destination as possible errors. It is
> > >>>>>
> > >>>> not
> > >>>
> > >>>> necessary to do a live check, but  at least do it before the page
> can
> > >>>>>
> > >>>> be
> > >>>
> > >>>> saved.
> > >>>>>
> > >>>>> Note the difference in what Yandex do and what we want to achieve;
> > >>>>>
> > >>>> Yandex
> > >>>
> > >>>> translates a text between two different languages, without any clear
> > >>>>>
> > >>>> reason
> > >>>>
> > >>>>> why. It is not to important if there are weird constructs in the
> > text,
> > >>>>>
> > >>>> as
> > >>>
> > >>>> long as it is usable in "some" context. We translate a text for the
> > >>>>>
> > >>>> purpose
> > >>>>
> > >>>>> of republishing it. The text should be usable and easily readable
> in
> > >>>>>
> > >>>> that
> > >>>
> > >>>> language.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> > >>>>> [email protected]> wrote:
> > >>>>>
> > >>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad <[email protected]>:
> > >>>>>>
> > >>>>>> Brute force solution; turn the ContentTranslation off. Really
> > >>>>>>>
> > >>>>>> stupid
> > >>>
> > >>>> solution.
> > >>>>>>>
> > >>>>>>
> > >>>>>> ... Then I guess you don't mind that I'm changing the thread name
> :)
> > >>>>>>
> > >>>>>>
> > >>>>>> The next solution; turn the Yandex engine off. That would solve a
> > >>>>>>> part of the problem. Kind of lousy solution though.
> > >>>>>>>
> > >>>>>>> What about adding a language model that warns when the language
> > >>>>>>>
> > >>>>>> constructs
> > >>>>>>
> > >>>>>>> gets to weird? It is like a "test" for the translation. The CT is
> > >>>>>>>
> > >>>>>> used
> > >>>>
> > >>>>> for
> > >>>>>>
> > >>>>>>> creating a translation, but the language model is used for
> > >>>>>>>
> > >>>>>> verifying
> > >>>
> > >>>> if
> > >>>>
> > >>>>> the
> > >>>>>>
> > >>>>>>> translation is good enough. If it does not validate against the
> > >>>>>>>
> > >>>>>> language
> > >>>>>
> > >>>>>> model it should simply not be published to the main name space. It
> > >>>>>>>
> > >>>>>> will
> > >>>>
> > >>>>> still be possible to create a draft, but then the user is
> > >>>>>>>
> > >>>>>> completely
> > >>>
> > >>>> aware
> > >>>>>>
> > >>>>>>> that the translation isn't good enough.
> > >>>>>>>
> > >>>>>>> Such a language model should be available as a test for any
> > >>>>>>>
> > >>>>>> article,
> > >>>
> > >>>> as
> > >>>>
> > >>>>> it
> > >>>>>>
> > >>>>>>> can be used as a quality measure for the article. It is really a
> > >>>>>>>
> > >>>>>> quantity
> > >>>>>
> > >>>>>> measure for the well-spokenness of the article, but that isn't
> > >>>>>>>
> > >>>>>> quite
> > >>>
> > >>>> so
> > >>>>
> > >>>>> intuitive.
> > >>>>>>>
> > >>>>>>> So, I'll allow myself to guess that you are talking about one
> > >>>>>>
> > >>>>> particular
> > >>>>
> > >>>>> language, probably Norwegian.
> > >>>>>>
> > >>>>>> Several technical facts:
> > >>>>>>
> > >>>>>> 1. In the past there were several cases in which translators to
> > >>>>>>
> > >>>>> different
> > >>>>
> > >>>>> languages who reported common translation mistakes to me. I passed
> > >>>>>>
> > >>>>> them
> > >>>
> > >>>> on
> > >>>>>
> > >>>>>> to Yandex developers, with whom I communicate quite regularly.
> They
> > >>>>>> acknowledged receiving all of them. I am aware of at least one
> such
> > >>>>>>
> > >>>>> common
> > >>>>>
> > >>>>>> mistake that was fixed; possibly there were more. If you can give
> me
> > >>>>>>
> > >>>>> a
> > >>>
> > >>>> list
> > >>>>>
> > >>>>>> of such mistakes for Norwegian, I'll be very happy to pass them
> on.
> > I
> > >>>>>> absolutely cannot promise that they will be fixed upstream, but
> it's
> > >>>>>> possible.
> > >>>>>>
> > >>>>>> 2. In Norwegian, Apertium is used for translating between the two
> > >>>>>>
> > >>>>> varieties
> > >>>>>
> > >>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other
> > Scandinavian
> > >>>>>> languages. That's probably why it works so well—they are similar
> in
> > >>>>>> grammar, vocabulary, and narrative style (I'll pass it on to
> > Apertium
> > >>>>>> developers—I'm sure they'll be happy to hear it). Unfortunately,
> > >>>>>>
> > >>>>> machine
> > >>>>
> > >>>>> translation from English is not available in Apertium. Apertium
> works
> > >>>>>>
> > >>>>> best
> > >>>>>
> > >>>>>> with very similar languages, and English has two characteristics,
> > >>>>>>
> > >>>>> which
> > >>>
> > >>>> are
> > >>>>>
> > >>>>>> unfortunate when combined: it is both the most popular source for
> > >>>>>> translation into almost all other languages (including Norwegian),
> > >>>>>>
> > >>>>> and
> > >>>
> > >>>> it
> > >>>>
> > >>>>> is not _very_ similar to any other languages (except maybe Scots).
> > >>>>>>
> > >>>>> Machine
> > >>>>>
> > >>>>>> translation from English into Norwegian is only possible with
> Yandex
> > >>>>>>
> > >>>>> at
> > >>>
> > >>>> the
> > >>>>>
> > >>>>>> moment. More engines may be added in the future, but at the moment
> > >>>>>>
> > >>>>> that's
> > >>>>
> > >>>>> all we have. That's why disabling Yandex completely would indeed
> be a
> > >>>>>>
> > >>>>> lousy
> > >>>>>
> > >>>>>> solution: A lot of people say that without machine translation
> > >>>>>>
> > >>>>> integration
> > >>>>>
> > >>>>>> Content Translation is useless. Not all users think like that, but
> > >>>>>>
> > >>>>> many
> > >>>
> > >>>> do.
> > >>>>>
> > >>>>>> 3. We can define a numerical threshold of acceptable percentage of
> > >>>>>>
> > >>>>> machine
> > >>>>>
> > >>>>>> translation post-editing. Currently it's 75%. It's a tad
> > >>>>>>
> > >>>>> embarrassing,
> > >>>
> > >>>> but
> > >>>>>
> > >>>>>> it's hard-coded at the moment, but it can be very easily be made
> > >>>>>>
> > >>>>> into a
> > >>>
> > >>>> variable per language. If the translator tries to publish a page in
> > >>>>>>
> > >>>>> which
> > >>>>
> > >>>>> less than that is modified, a warning will be shown.
> > >>>>>>
> > >>>>>> 4. I'm not sure what do you mean by "language model". If it's any
> > >>>>>>
> > >>>>> kind
> > >>>
> > >>>> of a
> > >>>>>
> > >>>>>> linguistic engine, then it's definitely not within the resources
> > that
> > >>>>>>
> > >>>>> the
> > >>>>
> > >>>>> Language team itself can currently dedicate. However, if somebody
> who
> > >>>>>>
> > >>>>> knows
> > >>>>>
> > >>>>>> Norwegian and some programming will write a script that analyzes
> > >>>>>>
> > >>>>> common
> > >>>
> > >>>> bad
> > >>>>>
> > >>>>>> constructs in a Wikipedia dump, this will be very useful. This
> would
> > >>>>>> basically be an upgraded version of suggestion #1 above. (In my
> > spare
> > >>>>>>
> > >>>>> time
> > >>>>>
> > >>>>>> as a volunteer I'm doing something comparable for Hebrew, although
> > >>>>>>
> > >>>>> not
> > >>>
> > >>>> for
> > >>>>>
> > >>>>>> translation, but for improving how MediaWiki link trails work.)
> > >>>>>> _______________________________________________
> > >>>>>> Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/
> > >>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > >>>>>> wiki/Wikimedia-l
> > >>>>>> New messages to: [email protected]
> > >>>>>> Unsubscribe: https://lists.wikimedia.org/
> > >>>>>>
> > >>>>> mailman/listinfo/wikimedia-l,
> > >>>
> > >>>> <mailto:[email protected]?subject=
> unsubscribe>
> > >>>>>>
> > >>>>> _______________________________________________
> > >>>>> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/
> > >>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > >>>>> wiki/Wikimedia-l
> > >>>>> New messages to: [email protected]
> > >>>>> Unsubscribe: https://lists.wikimedia.org/
> > mailman/listinfo/wikimedia-l,
> > >>>>> <mailto:[email protected]?subject=
> unsubscribe>
> > >>>>>
> > >>>> _______________________________________________
> > >>>> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/
> > >>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > >>>> wiki/Wikimedia-l
> > >>>> New messages to: [email protected]
> > >>>> Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l
> > ,
> > >>>> <mailto:[email protected]?subject=
> unsubscribe>
> > >>>>
> > >>>> _______________________________________________
> > >>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > >>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > >>> wiki/Wikimedia-l
> > >>> New messages to: [email protected]
> > >>> Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l,
> > >>> <mailto:[email protected]?subject=unsubscribe>
> > >>>
> > >>> _______________________________________________
> > >> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wik
> > >> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wik
> > >> i/Wikimedia-l
> > >> New messages to: [email protected]
> > >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
> ,
> > >> <mailto:[email protected]?subject=unsubscribe>
> > >>
> > >
> > >
> > >
> > > _______________________________________________
> > > Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wik
> > > i/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > wiki/Wikimedia-l
> > > New messages to: [email protected]
> > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > > <mailto:[email protected]?subject=unsubscribe>
> > >
> >
> >
> >
> > --
> > Etiamsi omnes, ego non
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > wiki/Wikimedia-l
> > New messages to: [email protected]
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:[email protected]?subject=unsubscribe>
> >
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: [email protected]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[email protected]?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [email protected]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:[email protected]?subject=unsubscribe>

Re: [Wikimedia-l] machine translation

Reply via email to