[Wikimedia-l] Tech Talk: Wikimedia Foundation Technology and Product Q Session #2

2017-05-02 Thread Srishti Sethi
Hello everyone,

Please join us for the Wikimedia Foundation Technology and Product Q
Session #2 by Victoria Coleman (CTO) and Toby Negrin (Interim VP of
Product) on May 9, 2017, at 17:00 UTC via YouTube live.

Link to live YouTube stream: https://www.youtube.com/watch?v=Q4kfgU9SZcg

IRC channel for questions/discussion: #wikimedia-office

More details:

This talk is a follow-up of the Wikimedia Developer Summit session
 and will address the next set
of questions gathered via a voting survey for the summit:


   -

   For WMF dev teams, what is the right balance between pushing own work
   versus seeking and supporting volunteer contributors?



   -

   Do we have a plan to bring our developer documentation to the level of a
   top Internet website, a major free software project?



   -

   How can volunteers bring ideas and influence the WMF annual plans and
   quarterly goals? (Currently, when plans are published it's too late)



   -

   What vision do you see for MediaWiki and volunteer developers five years
   from now?


Looking forward to your presence!

Best,
Srishti

--
Srishti Sethi
Developer Advocate
Technical Collaboration team
Wikimedia Foundation

https://www.mediawiki.org/wiki/User:SSethi_(WMF)
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Turn the extension for ContentTranslation off?

2017-05-02 Thread Nick Wilson (Quiddity)
Note, this thread was forked into a different subject/title.
See - "[Wikimedia-l] machine translation" - for newer discussion and detail.

​
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] machine translation

2017-05-02 Thread John Erling Blad
Actually this _is_ about turning ContentTranslation off, that is what
several users in the community want. They block people using the extension
and delete the translated articles. Use of ContentTranslation has become a
 rather contentious case.

Yandex as a general translation engine to be able to read some alien
language is quite good, but as an engine to produce written text it is not
very good at all. In fact it often creates quite horrible Norwegian, even
for closely related languages. One quite common problem is reordering of
words into meaningless constructs, an other problem is reordering lexical
gender in weird ways. The English preposition "a" is often translated as
"en" in a propositional phrase, and then the gender is added to the
following phrase. That gives a translation of  "Oppland is a county in…"
 into something like "Oppland er en fylket i…" This should be "Oppland er
et fylke i…".

(I just checked and it seems like Yandex messes up a lot less now than
previously, but it is still pretty bad.)

Apertium works because the language is closely related, Yandex does not
work because it is used between very different languages. People try to use
Yandex and gets disappointed, and falsely conclude that all language
translations are equally weird. They are not, but Yandex translations are
weird.

The numerical threshold does not work. The reason is simple, the number of
fixes depends on language constructs that fails, and that is simply not a
constant for small text fragments. Perhaps if we could flag specific
language constructs that is known to give a high percentage of failures,
and if the translator must check those sentences. One such language
construct is disappearances between the preposition and the gender of the
following term in a prepositional phrase. If they are not similar, then the
sentence must be checked. It is not always wrong to write "en jenta" in
Norwegian, but it is likely to be wrong.

A language model could be a statistical model for the language itself, not
for the translation into that language. We don't want a perfect language
model, but a sufficient language model to mark weird constructs. A very
simple solution could simply be to mark tri-grams that does not  already
exist in the text base for the destination as possible errors. It is not
necessary to do a live check, but  at least do it before the page can be
saved.

Note the difference in what Yandex do and what we want to achieve; Yandex
translates a text between two different languages, without any clear reason
why. It is not to important if there are weird constructs in the text, as
long as it is usable in "some" context. We translate a text for the purpose
of republishing it. The text should be usable and easily readable in that
language.



On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> 2017-05-02 18:20 GMT+03:00 John Erling Blad :
>
> > Brute force solution; turn the ContentTranslation off. Really stupid
> > solution.
>
>
> ... Then I guess you don't mind that I'm changing the thread name :)
>
>
> > The next solution; turn the Yandex engine off. That would solve a
> > part of the problem. Kind of lousy solution though.
> >
>
> > What about adding a language model that warns when the language
> constructs
> > gets to weird? It is like a "test" for the translation. The CT is used
> for
> > creating a translation, but the language model is used for verifying if
> the
> > translation is good enough. If it does not validate against the language
> > model it should simply not be published to the main name space. It will
> > still be possible to create a draft, but then the user is completely
> aware
> > that the translation isn't good enough.
> >
> > Such a language model should be available as a test for any article, as
> it
> > can be used as a quality measure for the article. It is really a quantity
> > measure for the well-spokenness of the article, but that isn't quite so
> > intuitive.
> >
>
> So, I'll allow myself to guess that you are talking about one particular
> language, probably Norwegian.
>
> Several technical facts:
>
> 1. In the past there were several cases in which translators to different
> languages who reported common translation mistakes to me. I passed them on
> to Yandex developers, with whom I communicate quite regularly. They
> acknowledged receiving all of them. I am aware of at least one such common
> mistake that was fixed; possibly there were more. If you can give me a list
> of such mistakes for Norwegian, I'll be very happy to pass them on. I
> absolutely cannot promise that they will be fixed upstream, but it's
> possible.
>
> 2. In Norwegian, Apertium is used for translating between the two varieties
> of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
> languages. That's probably why it works so well—they are similar in
> grammar, vocabulary, and narrative style (I'll pass it on to Apertium
> 

[Wikimedia-l] Recognition of the Commons Photographers User Group

2017-05-02 Thread Kirill Lokshin
Hi everyone!

I'm very happy to announce that the Affiliations Committee has recognized the
Commons Photographers User Group [1] as a Wikimedia User Group.  The group
is an international cooperative of photography enthusiasts who publish
their images under free licenses, with the goal of transcribing the world
visually and having others benefit from their work through Wikipedia and
other projects.

Please join me in congratulating the members of this new user group!

Regards,
Kirill Lokshin
Chair, Affiliations Committee

[1] https://meta.wikimedia.org/wiki/Commons_Photographers_User_Group
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] machine translation

2017-05-02 Thread Amir E. Aharoni
2017-05-02 21:47 GMT+03:00 John Erling Blad :

> Yandex as a general translation engine to be able to read some alien
> language is quite good, but as an engine to produce written text it is not
> very good at all.


... Nor is it supposed to be.

A translator is a person. Machine translation software is not a person,
it's software. It's a tool that is supposed to help a human translator
produce a good written text more quickly. If it doesn't make this work
faster, it can and should be disabled. If no translator


> In fact it often creates quite horrible Norwegian, even
> for closely related languages. One quite common problem is reordering of
> words into meaningless constructs, an other problem is reordering lexical
> gender in weird ways. The English preposition "a" is often translated as
> "en" in a propositional phrase, and then the gender is added to the
> following phrase. That gives a translation of  "Oppland is a county in…"
>  into something like "Oppland er en fylket i…" This should be "Oppland er
> et fylke i…".
>

I suggest making a page with a list of such examples, so that the machine
translation developers could read it.


> (I just checked and it seems like Yandex messes up a lot less now than
> previously, but it is still pretty bad.)
>

I guess that this is something that Yandex developers will be happy to hear
:)

More seriously, it's quite possible that they already used some of the
translations made by the Norwegian Wikipedia community. In addition to
being published as an article, each translated paragraph is saved into
parallel corpora, and machine translation developers read the edited text
and use it to improve their software. This is completely open and usable by
all machine translation developers, not only for Yandex.



> The numerical threshold does not work. The reason is simple, the number of
> fixes depends on language constructs that fails, and that is simply not a
> constant for small text fragments. Perhaps if we could flag specific
> language constructs that is known to give a high percentage of failures,
> and if the translator must check those sentences. One such language
> construct is disappearances between the preposition and the gender of the
> following term in a prepositional phrase.
>

The question is how would we do it with our software. I simply cannot
imagine doing it with the current MediaWiki platform, unless we develop a
sophisticated NLP engine, although it's possible I'm exaggerating or
forgetting something.


> A language model could be a statistical model for the language itself, not
> for the translation into that language. We don't want a perfect language
> model, but a sufficient language model to mark weird constructs. A very
> simple solution could simply be to mark tri-grams that does not  already
> exist in the text base for the destination as possible errors. It is not
> necessary to do a live check, but  at least do it before the page can be
> saved.
>

See above—we don't have support for plugging something like that into our
workflow.

Perhaps one day some AI/machine-learning system like ORES would be able to
do it. Maybe it could be an extension to ORES itself.


> Note the difference in what Yandex do and what we want to achieve; Yandex
> translates a text between two different languages, without any clear reason
> why. It is not to important if there are weird constructs in the text, as
> long as it is usable in "some" context. We translate a text for the purpose
> of republishing it. The text should be usable and easily readable in that
> language.
>

This is a well-known problem in machine translation: domain.

Professional industrial translation powerhouses use internally-customized
machine translation engines that specialize on particular domains, such as
medicine, law, or news. In theory, it would make a lot of sense to have a
customized machine translation engine for encyclopedic articles, or maybe
even for several different styles of encyclopedic articles (biography,
science, history, etc.). For now what we have is a very general-purpose
consumer-oriented engine. I hope it changes in the future.
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


[Wikimedia-l] New Free Research Accounts via the Wikipedia Library

2017-05-02 Thread Jake Orlowitz
Hi!
The Wikipedia Library has new, free research access available:

* Bloomsbury (Who's Who, Drama Online, Berg Fashion Library, and
Whitaker's):


* American Psychiatric Association (Psychiatry books and journals):


* Gaudeamus (Finnish publisher specializing in humanities and social
sciences):


* Ympäristö-lehti (The Finnish Environment Institute's Ympäristö-lehti
magazine): <
http://fi.wikipedia.org/wiki/Wikipedia:Wikipedian_Lähdekirjasto/Ympäristö-lehti
>

And expanded resources from:

*Gale - Biography in Context, new database


*Adam Matthew - All 53 collections now available


Many other partnerships with accounts are listed at:


Do better research and help expand the use of high quality references
across Wikipedia projects: sign up today!

-The Wikipedia Library Team
wikipedialibr...@wikimedia.org

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


[Wikimedia-l] machine translation

2017-05-02 Thread Amir E. Aharoni
2017-05-02 18:20 GMT+03:00 John Erling Blad :

> Brute force solution; turn the ContentTranslation off. Really stupid
> solution.


... Then I guess you don't mind that I'm changing the thread name :)


> The next solution; turn the Yandex engine off. That would solve a
> part of the problem. Kind of lousy solution though.
>

> What about adding a language model that warns when the language constructs
> gets to weird? It is like a "test" for the translation. The CT is used for
> creating a translation, but the language model is used for verifying if the
> translation is good enough. If it does not validate against the language
> model it should simply not be published to the main name space. It will
> still be possible to create a draft, but then the user is completely aware
> that the translation isn't good enough.
>
> Such a language model should be available as a test for any article, as it
> can be used as a quality measure for the article. It is really a quantity
> measure for the well-spokenness of the article, but that isn't quite so
> intuitive.
>

So, I'll allow myself to guess that you are talking about one particular
language, probably Norwegian.

Several technical facts:

1. In the past there were several cases in which translators to different
languages who reported common translation mistakes to me. I passed them on
to Yandex developers, with whom I communicate quite regularly. They
acknowledged receiving all of them. I am aware of at least one such common
mistake that was fixed; possibly there were more. If you can give me a list
of such mistakes for Norwegian, I'll be very happy to pass them on. I
absolutely cannot promise that they will be fixed upstream, but it's
possible.

2. In Norwegian, Apertium is used for translating between the two varieties
of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
languages. That's probably why it works so well—they are similar in
grammar, vocabulary, and narrative style (I'll pass it on to Apertium
developers—I'm sure they'll be happy to hear it). Unfortunately, machine
translation from English is not available in Apertium. Apertium works best
with very similar languages, and English has two characteristics, which are
unfortunate when combined: it is both the most popular source for
translation into almost all other languages (including Norwegian), and it
is not _very_ similar to any other languages (except maybe Scots). Machine
translation from English into Norwegian is only possible with Yandex at the
moment. More engines may be added in the future, but at the moment that's
all we have. That's why disabling Yandex completely would indeed be a lousy
solution: A lot of people say that without machine translation integration
Content Translation is useless. Not all users think like that, but many do.

3. We can define a numerical threshold of acceptable percentage of machine
translation post-editing. Currently it's 75%. It's a tad embarrassing, but
it's hard-coded at the moment, but it can be very easily be made into a
variable per language. If the translator tries to publish a page in which
less than that is modified, a warning will be shown.

4. I'm not sure what do you mean by "language model". If it's any kind of a
linguistic engine, then it's definitely not within the resources that the
Language team itself can currently dedicate. However, if somebody who knows
Norwegian and some programming will write a script that analyzes common bad
constructs in a Wikipedia dump, this will be very useful. This would
basically be an upgraded version of suggestion #1 above. (In my spare time
as a volunteer I'm doing something comparable for Hebrew, although not for
translation, but for improving how MediaWiki link trails work.)
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Turn the extension for ContentTranslation off?

2017-05-02 Thread Lodewijk
Hi John,

Could you provide a bit more context? From which language are you drawing
these experiences? Did you consider filing a phabricator request for the
technical component that can be improved (if so, could you link to it)?
Could you also provide some links to these discussions that are causing the
internal fighting you refer to?

I'd be curious to understand better what you're talking about before taking
a position. Thanks!

Best,
Lodewijk

2017-05-02 17:20 GMT+02:00 John Erling Blad :

> Yes, I wonder if the extension for content translation should be turned
> off. Not because it is really bad, but because it allows creating
> translations that isn't quite good enough, and those translations creates
> fierce internal fighting between contributors.
>
> Some people use CT, and makes fairly good translations. Some are even
> excellent, especially some of those based on machine translations through
> the Apertium engine. Some are done manually and are usually fairly good,
> but those done with the Yandex engine are usually very poor. Sometimes it
> seems like the Yandex engine produce so many weird constructs that the
> translators simply gives up, but sometimes it also seems like the most
> common errors simply passes through. I guess people simply gets used to see
> those errors and does not view them as "errors" anymore.
>
> Brute force solution; turn the ContentTranslation off. Really stupid
> solution. The next solution; turn the Yandex engine off. That would solve a
> part of the problem. Kind of lousy solution though.
>
> What about adding a language model that warns when the language constructs
> gets to weird? It is like a "test" for the translation. The CT is used for
> creating a translation, but the language model is used for verifying if the
> translation is good enough. If it does not validate against the language
> model it should simply not be published to the main name space. It will
> still be possible to create a draft, but then the user is completely aware
> that the translation isn't good enough.
>
> Such a language model should be available as a test for any article, as it
> can be used as a quality measure for the article. It is really a quantity
> measure for the well-spokenness of the article, but that isn't quite so
> intuitive.
>
> The measure could simply be to color code the language constructs after how
> common they are, with background color for common constructs in white and
> really awful constructs in yellow.
>
> It could also use hints from other measurements, like readability,
> confusion and perplexity. Perhaps even such things as punctuation and
> markup.
>
> I believe users will get the idea pretty fast; only publish texts that are
> "white". It is a bit like tests for developers; they don't publish code
> that goes "red".
> ___
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] machine translation

2017-05-02 Thread Pharos
I think it all depends on the level of engagement of the human translator.

When the tool is used in the right way, it is a fantastic tool.

Maybe we can find better methods to nudge people toward taking their time
and really doing work on their translations.

Thanks,
Pharos

On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
bodhisattwa.rg...@gmail.com> wrote:

> Content translation with Yandex is also a problem in Bengali Wikipedia.
> Some users have grown a tendency to create machine translated meaningless
> articles with this extension to increase edit count and article count. This
> has increased the workloads of admins to find and delete those articles.
>
> Yandex is not ready for many languages and it is better to shut it. We
> don't need it in Bengali.
>
> Regards
> On May 3, 2017 12:17 AM, "John Erling Blad"  wrote:
>
> > Actually this _is_ about turning ContentTranslation off, that is what
> > several users in the community want. They block people using the
> extension
> > and delete the translated articles. Use of ContentTranslation has become
> a
> >  rather contentious case.
> >
> > Yandex as a general translation engine to be able to read some alien
> > language is quite good, but as an engine to produce written text it is
> not
> > very good at all. In fact it often creates quite horrible Norwegian, even
> > for closely related languages. One quite common problem is reordering of
> > words into meaningless constructs, an other problem is reordering lexical
> > gender in weird ways. The English preposition "a" is often translated as
> > "en" in a propositional phrase, and then the gender is added to the
> > following phrase. That gives a translation of  "Oppland is a county in…"
> >  into something like "Oppland er en fylket i…" This should be "Oppland er
> > et fylke i…".
> >
> > (I just checked and it seems like Yandex messes up a lot less now than
> > previously, but it is still pretty bad.)
> >
> > Apertium works because the language is closely related, Yandex does not
> > work because it is used between very different languages. People try to
> use
> > Yandex and gets disappointed, and falsely conclude that all language
> > translations are equally weird. They are not, but Yandex translations are
> > weird.
> >
> > The numerical threshold does not work. The reason is simple, the number
> of
> > fixes depends on language constructs that fails, and that is simply not a
> > constant for small text fragments. Perhaps if we could flag specific
> > language constructs that is known to give a high percentage of failures,
> > and if the translator must check those sentences. One such language
> > construct is disappearances between the preposition and the gender of the
> > following term in a prepositional phrase. If they are not similar, then
> the
> > sentence must be checked. It is not always wrong to write "en jenta" in
> > Norwegian, but it is likely to be wrong.
> >
> > A language model could be a statistical model for the language itself,
> not
> > for the translation into that language. We don't want a perfect language
> > model, but a sufficient language model to mark weird constructs. A very
> > simple solution could simply be to mark tri-grams that does not  already
> > exist in the text base for the destination as possible errors. It is not
> > necessary to do a live check, but  at least do it before the page can be
> > saved.
> >
> > Note the difference in what Yandex do and what we want to achieve; Yandex
> > translates a text between two different languages, without any clear
> reason
> > why. It is not to important if there are weird constructs in the text, as
> > long as it is usable in "some" context. We translate a text for the
> purpose
> > of republishing it. The text should be usable and easily readable in that
> > language.
> >
> >
> >
> > On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> > amir.ahar...@mail.huji.ac.il> wrote:
> >
> > > 2017-05-02 18:20 GMT+03:00 John Erling Blad :
> > >
> > > > Brute force solution; turn the ContentTranslation off. Really stupid
> > > > solution.
> > >
> > >
> > > ... Then I guess you don't mind that I'm changing the thread name :)
> > >
> > >
> > > > The next solution; turn the Yandex engine off. That would solve a
> > > > part of the problem. Kind of lousy solution though.
> > > >
> > >
> > > > What about adding a language model that warns when the language
> > > constructs
> > > > gets to weird? It is like a "test" for the translation. The CT is
> used
> > > for
> > > > creating a translation, but the language model is used for verifying
> if
> > > the
> > > > translation is good enough. If it does not validate against the
> > language
> > > > model it should simply not be published to the main name space. It
> will
> > > > still be possible to create a draft, but then the user is completely
> > > aware
> > > > that the translation isn't good enough.
> > > >
> > > > Such a language model should 

Re: [Wikimedia-l] machine translation

2017-05-02 Thread Bodhisattwa Mandal
Content translation with Yandex is also a problem in Bengali Wikipedia.
Some users have grown a tendency to create machine translated meaningless
articles with this extension to increase edit count and article count. This
has increased the workloads of admins to find and delete those articles.

Yandex is not ready for many languages and it is better to shut it. We
don't need it in Bengali.

Regards
On May 3, 2017 12:17 AM, "John Erling Blad"  wrote:

> Actually this _is_ about turning ContentTranslation off, that is what
> several users in the community want. They block people using the extension
> and delete the translated articles. Use of ContentTranslation has become a
>  rather contentious case.
>
> Yandex as a general translation engine to be able to read some alien
> language is quite good, but as an engine to produce written text it is not
> very good at all. In fact it often creates quite horrible Norwegian, even
> for closely related languages. One quite common problem is reordering of
> words into meaningless constructs, an other problem is reordering lexical
> gender in weird ways. The English preposition "a" is often translated as
> "en" in a propositional phrase, and then the gender is added to the
> following phrase. That gives a translation of  "Oppland is a county in…"
>  into something like "Oppland er en fylket i…" This should be "Oppland er
> et fylke i…".
>
> (I just checked and it seems like Yandex messes up a lot less now than
> previously, but it is still pretty bad.)
>
> Apertium works because the language is closely related, Yandex does not
> work because it is used between very different languages. People try to use
> Yandex and gets disappointed, and falsely conclude that all language
> translations are equally weird. They are not, but Yandex translations are
> weird.
>
> The numerical threshold does not work. The reason is simple, the number of
> fixes depends on language constructs that fails, and that is simply not a
> constant for small text fragments. Perhaps if we could flag specific
> language constructs that is known to give a high percentage of failures,
> and if the translator must check those sentences. One such language
> construct is disappearances between the preposition and the gender of the
> following term in a prepositional phrase. If they are not similar, then the
> sentence must be checked. It is not always wrong to write "en jenta" in
> Norwegian, but it is likely to be wrong.
>
> A language model could be a statistical model for the language itself, not
> for the translation into that language. We don't want a perfect language
> model, but a sufficient language model to mark weird constructs. A very
> simple solution could simply be to mark tri-grams that does not  already
> exist in the text base for the destination as possible errors. It is not
> necessary to do a live check, but  at least do it before the page can be
> saved.
>
> Note the difference in what Yandex do and what we want to achieve; Yandex
> translates a text between two different languages, without any clear reason
> why. It is not to important if there are weird constructs in the text, as
> long as it is usable in "some" context. We translate a text for the purpose
> of republishing it. The text should be usable and easily readable in that
> language.
>
>
>
> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> amir.ahar...@mail.huji.ac.il> wrote:
>
> > 2017-05-02 18:20 GMT+03:00 John Erling Blad :
> >
> > > Brute force solution; turn the ContentTranslation off. Really stupid
> > > solution.
> >
> >
> > ... Then I guess you don't mind that I'm changing the thread name :)
> >
> >
> > > The next solution; turn the Yandex engine off. That would solve a
> > > part of the problem. Kind of lousy solution though.
> > >
> >
> > > What about adding a language model that warns when the language
> > constructs
> > > gets to weird? It is like a "test" for the translation. The CT is used
> > for
> > > creating a translation, but the language model is used for verifying if
> > the
> > > translation is good enough. If it does not validate against the
> language
> > > model it should simply not be published to the main name space. It will
> > > still be possible to create a draft, but then the user is completely
> > aware
> > > that the translation isn't good enough.
> > >
> > > Such a language model should be available as a test for any article, as
> > it
> > > can be used as a quality measure for the article. It is really a
> quantity
> > > measure for the well-spokenness of the article, but that isn't quite so
> > > intuitive.
> > >
> >
> > So, I'll allow myself to guess that you are talking about one particular
> > language, probably Norwegian.
> >
> > Several technical facts:
> >
> > 1. In the past there were several cases in which translators to different
> > languages who reported common translation mistakes to me. I passed them
> on
> > to Yandex developers, with 

[Wikimedia-l] What's making you happy this week? (Week of 30 April 2017)

2017-05-02 Thread Pine W
I'm happy to see the development of the Commons Photographers User Group
.

Personal background story (feel free to skip reading this):

The first DSLR I touched was easy to use with the automatic settings for
indoor photography in good lighting. Based on this limited experience, I
concluded that photography with a DSLR was easy. Some time later I bought
my own first DSLR, and quickly got lost. The menus were not intuitive to me
as a DSLR newbie, there were new terms like "aperture" and "f-stop", the
manual was written for someone who already had good technical knowledge of
how cameras work, and my lens wouldn't focus like I wanted. Wikipedia has
some helpful articles about photography concepts, but what would have
helped me a lot is spending time with an experienced photographer. After a
few years of trial and error, and asking questions of more knowledgeable
people, I'm happy with my skill level as a photography hobbyist in a
variety of situations. I hope that the new user Commons Photographers group
will facilitate knowledge exchange, improve camaraderie, and consider ways
to improve access to equipment -- especially for photographers in
situations where resources are scarce and potential for valuable
open-source contributions are very high.

What's making you happy this week?

Pine
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] [Wikitech-l] Tech Talk: A Gentle Introduction to Wikidata for Absolute Beginners [including non-techies!]

2017-05-02 Thread Asaf Bartov
Hello again.

I'd like to let you know that thanks to Victor Grigas, the Commons video of
this talk (link below) now has English subtitles, synced to the talk.  This
should make it easier to *translate the subtitles *to make this video
useful for fellow Wikimedians in other languages.

It's a very long tutorial, so this would be a significant effort, perhaps
best taken by a group, or piece by piece.  If you do complete the
translation in any language, I'd love to hear about it.

Cheers,

   Asaf

On Thu, Feb 9, 2017 at 10:32 PM Asaf Bartov  wrote:

> Here's the (3-hour) footage of the detailed Wikidata tutorial delivered
> today:
>
> on Commons:
> https://commons.wikimedia.org/wiki/File:A_Gentle_Introduction_to_Wikidata_for_Absolute_Beginners_(including_non-techies!).webm
>
> on YouTube: https://www.youtube.com/watch?v=eVrAx3AmUvA
>
> the slides:
> https://commons.wikimedia.org/wiki/File:Wikidata_-_A_Gentle_Introduction_for_Complete_Beginners_(WMF_February_2017).pdf
>
> It covers what Wikidata is (00:00), how to contribute new data to Wikidata
> (1:09:34), how to create an entirely new item on Wikidata (1:27:07), how to
> embed data from Wikidata into pages on other wikis (1:52:54), tools like
> the Wikidata Game (1:39:20), Article Placeholder (2:01:01), Reasonator
> (2:54:15) and Mix-and-match (2:57:05), and how to query Wikidata (including
> SPARQL examples) (starting 2:05:05).
>
> Share and enjoy. :)
>
>A.
>
> On Fri, Feb 3, 2017 at 4:35 PM Rachel Farrand 
> wrote:
>
>> Please join for the following talk:
>>
>> *Tech Talk**:* A Gentle Introduction to Wikidata for Absolute Beginners
>> [including non-techies!]
>> *Presenter:* Asaf Bartov
>> *Date:* February 09, 2017
>> *Time: *19:00 UTC
>> <
>> https://www.timeanddate.com/worldclock/fixedtime.html?msg=Tech+Talk%3A+A+Gentle+Introduction+to+Wikidata+for+Absolute+Beginners+%5Bincluding+non-techies%21%5D+=20170209T19=1440=3
>> >
>> Link to live YouTube stream 
>> *IRC channel for questions/discussion:* #wikimedia-office
>>
>> *Summary: *This talk will introduce you to the Wikimedia Movement's latest
>> major wiki project: Wikidata. We will cover what Wikidata is, how to
>> contribute, how to embed Wikidata into articles on other wikis, tools like
>> the Wikidata Game, and how to query Wikidata (including SPARQL examples).
>> ___
>> Wikitech-l mailing list
>> wikitec...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Turn the extension for ContentTranslation off?

2017-05-02 Thread James Heilman
Many of the volunteers I work with really like the "Content Translation"
tool. Machine translation only exists in some languages. For many it is not
an option at all.

Yes peer review processes are needed, and yes doing follow up and inviting
new people to our movement is a lot of work (I do a fair bit of it on EN WP
with respect to educational efforts). Doing so; however, is important for
our long term existence.

James

On Tue, May 2, 2017 at 9:20 AM, John Erling Blad  wrote:

> Yes, I wonder if the extension for content translation should be turned
> off. Not because it is really bad, but because it allows creating
> translations that isn't quite good enough, and those translations creates
> fierce internal fighting between contributors.
>
> Some people use CT, and makes fairly good translations. Some are even
> excellent, especially some of those based on machine translations through
> the Apertium engine. Some are done manually and are usually fairly good,
> but those done with the Yandex engine are usually very poor. Sometimes it
> seems like the Yandex engine produce so many weird constructs that the
> translators simply gives up, but sometimes it also seems like the most
> common errors simply passes through. I guess people simply gets used to see
> those errors and does not view them as "errors" anymore.
>
> Brute force solution; turn the ContentTranslation off. Really stupid
> solution. The next solution; turn the Yandex engine off. That would solve a
> part of the problem. Kind of lousy solution though.
>
> What about adding a language model that warns when the language constructs
> gets to weird? It is like a "test" for the translation. The CT is used for
> creating a translation, but the language model is used for verifying if the
> translation is good enough. If it does not validate against the language
> model it should simply not be published to the main name space. It will
> still be possible to create a draft, but then the user is completely aware
> that the translation isn't good enough.
>
> Such a language model should be available as a test for any article, as it
> can be used as a quality measure for the article. It is really a quantity
> measure for the well-spokenness of the article, but that isn't quite so
> intuitive.
>
> The measure could simply be to color code the language constructs after how
> common they are, with background color for common constructs in white and
> really awful constructs in yellow.
>
> It could also use hints from other measurements, like readability,
> confusion and perplexity. Perhaps even such things as punctuation and
> markup.
>
> I believe users will get the idea pretty fast; only publish texts that are
> "white". It is a bit like tests for developers; they don't publish code
> that goes "red".
> ___
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 




-- 
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


[Wikimedia-l] Turn the extension for ContentTranslation off?

2017-05-02 Thread John Erling Blad
Yes, I wonder if the extension for content translation should be turned
off. Not because it is really bad, but because it allows creating
translations that isn't quite good enough, and those translations creates
fierce internal fighting between contributors.

Some people use CT, and makes fairly good translations. Some are even
excellent, especially some of those based on machine translations through
the Apertium engine. Some are done manually and are usually fairly good,
but those done with the Yandex engine are usually very poor. Sometimes it
seems like the Yandex engine produce so many weird constructs that the
translators simply gives up, but sometimes it also seems like the most
common errors simply passes through. I guess people simply gets used to see
those errors and does not view them as "errors" anymore.

Brute force solution; turn the ContentTranslation off. Really stupid
solution. The next solution; turn the Yandex engine off. That would solve a
part of the problem. Kind of lousy solution though.

What about adding a language model that warns when the language constructs
gets to weird? It is like a "test" for the translation. The CT is used for
creating a translation, but the language model is used for verifying if the
translation is good enough. If it does not validate against the language
model it should simply not be published to the main name space. It will
still be possible to create a draft, but then the user is completely aware
that the translation isn't good enough.

Such a language model should be available as a test for any article, as it
can be used as a quality measure for the article. It is really a quantity
measure for the well-spokenness of the article, but that isn't quite so
intuitive.

The measure could simply be to color code the language constructs after how
common they are, with background color for common constructs in white and
really awful constructs in yellow.

It could also use hints from other measurements, like readability,
confusion and perplexity. Perhaps even such things as punctuation and
markup.

I believe users will get the idea pretty fast; only publish texts that are
"white". It is a bit like tests for developers; they don't publish code
that goes "red".
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,