Re: [Wikimedia-l] The case for supporting open source machine translation

2013-05-22 Thread Federico Leva (Nemo)

Erik Moeller, 24/04/2013 08:29:

[...]
Could open source MT be such a strategic investment? I don't know, but
I'd like to at least raise the question. I think the alternative will
be, for the foreseeable future, to accept that this piece of
technology will be proprietary, and to rely on goodwill for any
integration that concerns Wikimedia. Not the worst outcome, but also
not the best one.

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.


Some info on state of the art: 
http://laxstrom.name/blag/2013/05/22/on-course-to-machine-translation/


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-30 Thread Chris Tophe
2013/4/29 Mathieu Stumpf 

> Le 2013-04-26 20:27, Milos Rancic a écrit :
>
> OmegaWiki is a masterpiece from the perspective of one [computational]
>> linguist. Erik made the structure so well, that it's the best starting
>> point to create a contemporary multilingual dictionary. I didn't see
>> anything better in concept. (And, yes, when I was thinking about
>> creating such software by my own, I was always at the dead end of
>> "but, OmegaWiki is already that".)
>>
>
> Where can I find documentation about this structure, please ?



Here (planned structure):
http://meta.wikimedia.org/wiki/OmegaWiki_data_design

and also there (current structure):
http://www.omegawiki.org/Help:OmegaWiki_database_layout

And a gentle reminder that comments are requested ;-)
http://meta.wikimedia.org/wiki/Requests_for_comment/Adopt_OmegaWiki
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-29 Thread Mathieu Stumpf

Le 2013-04-26 20:27, Milos Rancic a écrit :
On Fri, Apr 26, 2013 at 7:57 PM, Samuel Klein  
wrote:

Yes.  Finding a way to capture and integrate the work OmegaWiki has
done into a new Wikidata-powered Wiktionary would be a useful start.
And we've already sort of claimed the space (though we are 
neglecting

it) -- it's discouraging to anyone else who might otherwise try to
build a brilliant free structured dictionary that we are *so close* 
to

getting it right.


OmegaWiki is a masterpiece from the perspective of one 
[computational]
linguist. Erik made the structure so well, that it's the best 
starting

point to create a contemporary multilingual dictionary. I didn't see
anything better in concept. (And, yes, when I was thinking about
creating such software by my own, I was always at the dead end of
"but, OmegaWiki is already that".)


Where can I find documentation about this structure, please ?


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-29 Thread Mathieu Stumpf

Le 2013-04-26 19:57, Samuel Klein a écrit :
On Fri, Apr 26, 2013 at 1:24 PM, Bjoern Hoehrmann  
wrote:

* Erik Moeller wrote:

Are there open source MT efforts that are close enough to merit
scrutiny?


Wiktionary. If you want to help free software efforts in the area of
machine translation, then what they seem to need most is high 
quality

data about words, word forms, and so on, in a readily machine-usable
form, and freely licensed.


Yes.  Finding a way to capture and integrate the work OmegaWiki has
done into a new Wikidata-powered Wiktionary would be a useful start.
And we've already sort of claimed the space (though we are neglecting
it) -- it's discouraging to anyone else who might otherwise try to
build a brilliant free structured dictionary that we are *so close* 
to

getting it right.



If you have suggestions on Wiktionnaries future, please consider to 
share them on https://meta.wikimedia.org/wiki/Wiktionary_future


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-29 Thread Mathieu Stumpf

Le 2013-04-26 17:00, Gerard Meijssen a écrit :

Hoi,
When we invest in MT it is to convey knowledge, information and 
primarily
Wikipedia articles. They do not have the same problems poetry has. 
With
explanatory articles on a subject there is a web of associated 
concepts.
These concepts are likely to occur in any language if the subject 
exists in

that other language.

Consequently MT can work for Wikipedia and provide quite a solid
interpretation of what the article is about. This is helped when the
associated concepts are recognised as such and when the translations 
for

these concepts are used in the MT.
Thanks,
  GerardM


I think that poetry just make a easy to grab example of the more 
general problem of "lexical/meaning" intrication which will appears at 
some point. Different cultures will have different conceptualizations of 
what one may perceive. So this is not just a matter of concept sets, but 
rather of concept network dynamics, how concepts interacts within a 
world representation. And interaction means combinatorial problems, 
which require paramount ressources.


Those said, I agree that having MT to help "adapt" articles from one 
language/culture to an other one would be useful.





On 26 April 2013 10:38, Mathieu Stumpf 
wrote:



Le 2013-04-25 20:56, Theo10011 a écrit :

 As far as Linguistic typology goes, it's far too unique and too 
varied to

have a language independent form develop as easily. Perhaps it also
depends
on the perspective. For example, the majority of people commenting 
here

(Americans, Europeans) might have exposure to a limited set of a
linguistic
branch. Machine-translations as someone pointed out, are still not
preferred in some languages, even with years of research and 
potentially
unlimited resources at Google's disposal, they still come out 
sounding
clunky in some ways. And perhaps they will never get to the level 
of

absolute, where they are truly language independent.



To my mind, there's no such thing as "absolute" meaning. It's all 
about
intrepretation in a given a context by a given interpreter. I mean, 
I do
think that MT could probably be as good as a profesional 
translators. But
even profesional translators can't make "perfect translations". I 
already
gave the example of poetry, but you may also take example of humour, 
which
ask for some cultural background, otherwise you have to explain why 
it's

funny and you know that you have to explain a joke, it's not a joke.


 If you read some of
the discussions in linguistic relativity (Sapir-Whorf hypothesis), 
there

is
research to suggest that a language a person is born with dictates 
their

thought processes and their view of the world - there might not be
absolutes when it comes to linguistic cognition. There is something
inherently unique in the cognitive patterns of different languages.



That's just how learning process work, you can't "understand" 
something
you didn't experiment. Reading an algorithm won't give you the 
insight
you'll get when you process it mentaly (with the help of pencil and 
paper)
and a textual description of "making love" won't provide you the 
feeling it

provide.



 Which brings me to the point, why not English? Your idea seems 
plausible

enough even if your remove the abstract idea of complete language
universality, without venturing into the science-fiction labyrinth 
of

man-machine collaboration.



English have many so called "non-neutral" problems. As far as I 
know, if
the goal is to use syntactically unambiguous human language, lojban 
is the
best current candidate. English as an international language is a 
very
harmful situation. Believe it or not, but I sometime have to 
translate to
English sentences which are written in French, because the writer 
was
thinking with English idiomatic locution that he poorly translated 
to

French, its native language in which it doesn't know the idiomatic
locution. Even worst, I red people which where where using concepts 
with an
English locution because they never matched it with the French 
locution
that they know. And in the other way, I'm not sure that having 
millions of
people speaking a broken English is a wonderful situation for this 
language.


Search "why not english as international language" if you need more
documentation.


--
Association Culture-Libre
http://www.culture-libre.org/

__**_
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.**org 
Unsubscribe: 
https://lists.wikimedia.org/**mailman/listinfo/wikimedia-l



___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedi

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-28 Thread Nikola Smolenski

On 26/04/13 19:38, Bjoern Hoehrmann wrote:

* Andrea Zanni wrote:

At the moment, Wikisource could be a interesting corpora and laboratory for
improving and enhancing OCR,
as the OCR generated text is always proofread and corrected by humans.


Try also Distributed Proofreaders. It is my impression that Wikisource's 
proofreading standards are not always up to par.



As part of our project (
http://wikisource.org/wiki/Wikisource_vision_development), Micru was
looking for a GSoC candidate for studing the reinsertion of proofread text
into djvus [1], but at the moment didn't find any interested student. We
have some contacts with people at Google working on Tesseract, and they
were available for mentoring.



[1] We thought about this both for OCR enhancement purposes and files
updating on Commons and Internet Archive (which is off topic here).


I built various tools that could be fairly easily adapted for this, my
http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr
notes are available. One of the tools for instance is a diff tool, see
image at .


This is a very interesting approach :)

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-27 Thread Jane Darnell
Just the thought of synchronizing Wikis makes me shudder. I think this
was also the reason that no Wikipedian editors were attracted to the
CoSyne project, though as it was explained to me the idea was that
only sections of a "source" Wikipedia article would be translated that
did not exist yet in the target article. This may be useful in the
case of a large article being the source, and a stub being the target,
but in the case where the source and the target are about equal size,
it could lead to a major mess.

In the example of the Wikipedia article on Haarlem, I noticed many of
the things lacking in the English version are things more relevant to
local people reading the Dutch version, such as local mass transit
information. The other way around, the things in the English version
that are lacking in the Dutch version are items that seem obvious to
locals.

2013/4/27, Ryu Cheol :
> Thanks to Jane for introducing CoSyne. But I feel all the wikis do not want
> to be synchronized to certain wikis. Rather than having identical articles,
> I hope they would have their own articles. I hope I could have two more tabs
> at right of the 'Article' and 'Talk' on English Wikipedia for Korean
> language. The two tabs are 'Article in Korean' and 'Talk in Korean'. The
> translations would have same information in originals and any editing on an
> article or a talk in translation pages would go back to the originals. In
> this case they need to be synchronized precisely.
>
> I mean these are done in the scope of English Wikipedia, not related to
> Korean Wikipedia. But the Korean Wikipedia linked to the left side of a page
> would be benefited from the translations in English Wikipedia eventually
> when an Korean Wikipedia editor find a good part of English Wikipedia
> article could be inserted to Korean Wikipedia.
>
> You can find the merits of the exact Korean translation of English Wikipedia
> or the scheme of the exact translation of big Wikipedias. It will help you
> reach to more potential contributors. It will make the language barrier
> lower for those who want to contribute to a Wikipedia they do not speak very
> well. Also, It could provide the better aligned corpora and it could could
> track how human translators or reviewers improve the translations.
>
> Cheol
>
> On 2013. 4. 26., at 오후 9:04, Jane Darnell  wrote:
>
>> We already have the translation options on the left side of the screen
>> in any Wikipedia article.
>> This choice is generally a smattering of languages, and a long term
>> goal for many small-language Wikipedias is to be able to translate an
>> article from related languages (say from Dutch into Frisian, where the
>> Frisian Wikipedia has no article at all on the title subject) and the
>> even longer-term goal is to translate into some other
>> really-really-really foreign language.
>>
>> Wouldn't it be easier however, to start with a project that uses
>> translatewiki and the related-language pairs? Usually there is a big
>> difference in numbers of articles (like between the Dutch Wikipedia
>> and the Frisian Wikipedia). Presumably the demand is larger on the
>> destination wikipedia (because there are fewer articles in those
>> languages), and the potential number of human translators is larger
>> (because most editors active in the smaller Wikipedia are versed in
>> both langages).
>>
>> The Dutch Wikimedia chapter took part in a European multilingual
>> synchronization tool project called CoSyne:
>> http://cosyne.eu/index.php/Main_Page
>>
>> It was not a success, because it was hard to figure out how this would
>> be beneficial to Wikipedians actually joining the project. Some
>> funding that was granted to the chapter to work on the project will be
>> returned, because it was never spent.
>>
>> In order to tackle this problem on a large scale, it needs to be
>> broken down into words, sentences, paragraphs and perhaps other
>> structures (category trees?). I think CoSyne was trying to do this. I
>> think it would be easier to keep the effort in one-way-traffic, so try
>> to offer machine translation from Dutch to Frisian and not the other
>> way around, and then as you go, define concepts that work both ways,
>> so that eventually it would be possible to translated from Frisian
>> into Dutch.
>>
>> 2013/4/26, Mathieu Stumpf :
>>> Le 2013-04-25 20:56, Theo10011 a écrit :
 As far as Linguistic typology goes, it's far too unique and too
 varied to
 have a language independent form develop as easily. Perhaps it also
 depends
 on the perspective. For example, the majority of people commenting
 here
 (Americans, Europeans) might have exposure to a limited set of a
 linguistic
 branch. Machine-translations as someone pointed out, are still not
 preferred in some languages, even with years of research and
 potentially
 unlimited resources at Google's disposal, they still come out
 sounding
 clunky in some ways. And perhaps they will never

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Ryu Cheol
Thanks to Jane for introducing CoSyne. But I feel all the wikis do not want to 
be synchronized to certain wikis. Rather than having identical articles, I hope 
they would have their own articles. I hope I could have two more tabs at right 
of the 'Article' and 'Talk' on English Wikipedia for Korean language. The two 
tabs are 'Article in Korean' and 'Talk in Korean'. The translations would have 
same information in originals and any editing on an article or a talk in 
translation pages would go back to the originals. In this case they need to be 
synchronized precisely.

I mean these are done in the scope of English Wikipedia, not related to Korean 
Wikipedia. But the Korean Wikipedia linked to the left side of a page would be 
benefited from the translations in English Wikipedia eventually when an Korean 
Wikipedia editor find a good part of English Wikipedia article could be 
inserted to Korean Wikipedia.

You can find the merits of the exact Korean translation of English Wikipedia or 
the scheme of the exact translation of big Wikipedias. It will help you reach 
to more potential contributors. It will make the language barrier lower for 
those who want to contribute to a Wikipedia they do not speak very well. Also, 
It could provide the better aligned corpora and it could could track how human 
translators or reviewers improve the translations. 

Cheol

On 2013. 4. 26., at 오후 9:04, Jane Darnell  wrote:

> We already have the translation options on the left side of the screen
> in any Wikipedia article.
> This choice is generally a smattering of languages, and a long term
> goal for many small-language Wikipedias is to be able to translate an
> article from related languages (say from Dutch into Frisian, where the
> Frisian Wikipedia has no article at all on the title subject) and the
> even longer-term goal is to translate into some other
> really-really-really foreign language.
> 
> Wouldn't it be easier however, to start with a project that uses
> translatewiki and the related-language pairs? Usually there is a big
> difference in numbers of articles (like between the Dutch Wikipedia
> and the Frisian Wikipedia). Presumably the demand is larger on the
> destination wikipedia (because there are fewer articles in those
> languages), and the potential number of human translators is larger
> (because most editors active in the smaller Wikipedia are versed in
> both langages).
> 
> The Dutch Wikimedia chapter took part in a European multilingual
> synchronization tool project called CoSyne:
> http://cosyne.eu/index.php/Main_Page
> 
> It was not a success, because it was hard to figure out how this would
> be beneficial to Wikipedians actually joining the project. Some
> funding that was granted to the chapter to work on the project will be
> returned, because it was never spent.
> 
> In order to tackle this problem on a large scale, it needs to be
> broken down into words, sentences, paragraphs and perhaps other
> structures (category trees?). I think CoSyne was trying to do this. I
> think it would be easier to keep the effort in one-way-traffic, so try
> to offer machine translation from Dutch to Frisian and not the other
> way around, and then as you go, define concepts that work both ways,
> so that eventually it would be possible to translated from Frisian
> into Dutch.
> 
> 2013/4/26, Mathieu Stumpf :
>> Le 2013-04-25 20:56, Theo10011 a écrit :
>>> As far as Linguistic typology goes, it's far too unique and too
>>> varied to
>>> have a language independent form develop as easily. Perhaps it also
>>> depends
>>> on the perspective. For example, the majority of people commenting
>>> here
>>> (Americans, Europeans) might have exposure to a limited set of a
>>> linguistic
>>> branch. Machine-translations as someone pointed out, are still not
>>> preferred in some languages, even with years of research and
>>> potentially
>>> unlimited resources at Google's disposal, they still come out
>>> sounding
>>> clunky in some ways. And perhaps they will never get to the level of
>>> absolute, where they are truly language independent.
>> 
>> To my mind, there's no such thing as "absolute" meaning. It's all about
>> intrepretation in a given a context by a given interpreter. I mean, I do
>> think that MT could probably be as good as a profesional translators.
>> But even profesional translators can't make "perfect translations". I
>> already gave the example of poetry, but you may also take example of
>> humour, which ask for some cultural background, otherwise you have to
>> explain why it's funny and you know that you have to explain a joke,
>> it's not a joke.
>> 
>>> If you read some of
>>> the discussions in linguistic relativity (Sapir-Whorf hypothesis),
>>> there is
>>> research to suggest that a language a person is born with dictates
>>> their
>>> thought processes and their view of the world - there might not be
>>> absolutes when it comes to linguistic cognition. There is something
>>> inherently uniqu

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread rupert THURNER
On Wed, Apr 24, 2013 at 10:49 AM, Mathias Schindler
 wrote:
> On Wed, Apr 24, 2013 at 8:29 AM, Erik Moeller  wrote:
>
>
>> Could open source MT be such a strategic investment? I don't know, but
>> I'd like to at least raise the question. I think the alternative will
>> be, for the foreseeable future, to accept that this piece of
>> technology will be proprietary, and to rely on goodwill for any
>> integration that concerns Wikimedia. Not the worst outcome, but also
>> not the best one.
>>
>> Are there open source MT efforts that are close enough to merit
>> scrutiny?
>
> http://www.statmt.org/moses/ is live an kicking. Someone with a
> background in computer linguistics should have a close look at them.
>
> I would like to mention however that there are a couple of cases in
...
> In any case, I would love to see WMF engage in the topic of machine 
> translation.

thanks a lot erik and mathias for this constructive input! i'd love to
see that too, and, from a volunteer standpoint, not only financing
further development seems adicting, but also training (eg
http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining)
seems to be something bite-sized which might fit the wiki-model and
the wikimedia volunteer community structure quite well.

 rupert.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Milos Rancic
On Fri, Apr 26, 2013 at 7:57 PM, Samuel Klein  wrote:
> Yes.  Finding a way to capture and integrate the work OmegaWiki has
> done into a new Wikidata-powered Wiktionary would be a useful start.
> And we've already sort of claimed the space (though we are neglecting
> it) -- it's discouraging to anyone else who might otherwise try to
> build a brilliant free structured dictionary that we are *so close* to
> getting it right.

OmegaWiki is a masterpiece from the perspective of one [computational]
linguist. Erik made the structure so well, that it's the best starting
point to create a contemporary multilingual dictionary. I didn't see
anything better in concept. (And, yes, when I was thinking about
creating such software by my own, I was always at the dead end of
"but, OmegaWiki is already that".)

At the other side, OmegaWiki software is from the previous decade and
it requires major fixes. And, obviously, WMF should do that.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Milos Rancic
On Thu, Apr 25, 2013 at 4:26 PM, Denny Vrandečić
 wrote:
> Not just bootstrapping the content. By having the primary content be saved
> in a language independent form, and always translating it on the fly, it
> would not merely bootstrap content in different languages, but it would
> mean that editors from different languages would be working on the same
> content. The texts in the different language is not a translation of each
> other, but they are all created from the same source. There would be no
> primacy of, say, English.

What we can is to make Simple English Wikipedia more useful and
rewrite rules from the Simple English language to the Controlled
English language and to allow filling the content of the smaller
Wikipedias from Simple English Wikipedia. That's the only way how to
get anything more useful than Google Translate output.

There are serious problems in relation to the "translation of
translation" process and that kind of complexity is not in the range
of contemporary science. (Basically, even good machine translation is
not in in the range contemporary science. Statistical approaches are
useful for getting basic understanding, but very bad for writing
encyclopedia or anything else which requires correct output in the
targeted language.)

On a much simpler scale of conversion engines, we can see that even 1%
of errors (or manual interventions) are serious issue for the text
integrity, while translations of translations are creating much more
errors, no matter would there be human interventions or not. And
that's not acceptable for average editor of the project in targeted
language.

Said so, we'd need serious linguistic work for every language added to
the system.

At the other side, I support Erik's intention to make free software
tool for machine translation. But note that it's just the second step
(Wikidata was the first one) on the long way.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Samuel Klein
On Fri, Apr 26, 2013 at 1:24 PM, Bjoern Hoehrmann  wrote:
> * Erik Moeller wrote:
>>Are there open source MT efforts that are close enough to merit
>>scrutiny?
>
> Wiktionary. If you want to help free software efforts in the area of
> machine translation, then what they seem to need most is high quality
> data about words, word forms, and so on, in a readily machine-usable
> form, and freely licensed.

Yes.  Finding a way to capture and integrate the work OmegaWiki has
done into a new Wikidata-powered Wiktionary would be a useful start.
And we've already sort of claimed the space (though we are neglecting
it) -- it's discouraging to anyone else who might otherwise try to
build a brilliant free structured dictionary that we are *so close* to
getting it right.

><   [ Andrea's ideas about using Wikisource to improve OCR tools ]
>
> I built various tools that could be fairly easily adapted for this, my
> http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr
> notes are available. One of the tools for instance is a diff tool, see
> image at .

I hope the related GSOC project gets support.  Getting mentoring from
Tesseract team members seems like a handy way to keep the projects
connected.


Tim Starling writes:
> We could basically clone the frontend component of Google Translate,
> and use Moses as a backend. The work would be mostly JavaScript...
> the next job would be to develop a corpus sharing site, hosting any
> available freely-licensed output of the frontend tool.

This would be most useful.  There are often short quick translation
projects that I would like to do through this sort of TM-capturing
interface; for which the translatewiki prep-process is rather time
consuming.

We can set up a corpus sharing site now, with translatewiki - there is
already a lot of material there that could be part of it.  Different
corpora (say, encyclopedic articles v. dictionary pages v. quotes)
would need to be tagged for context.  And we could start letting
people upload their own freely licensed corpora to include as well.
We would probably want a vetting process before giving users the
import tool; or a quarantine until we had better ways to let editors
revert / bulk-modify entire imports.

SJ

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Bjoern Hoehrmann
* Andrea Zanni wrote:
>At the moment, Wikisource could be a interesting corpora and laboratory for
>improving and enhancing OCR,
>as the OCR generated text is always proofread and corrected by humans.
>As part of our project (
>http://wikisource.org/wiki/Wikisource_vision_development), Micru was
>looking for a GSoC candidate for studing the reinsertion of proofread text
>into djvus [1], but at the moment didn't find any interested student. We
>have some contacts with people at Google working on Tesseract, and they
>were available for mentoring.

>[1] We thought about this both for OCR enhancement purposes and files
>updating on Commons and Internet Archive (which is off topic here).

I built various tools that could be fairly easily adapted for this, my
http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr
notes are available. One of the tools for instance is a diff tool, see
image at .
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Bjoern Hoehrmann
* Erik Moeller wrote:
>Are there open source MT efforts that are close enough to merit
>scrutiny? In order to be able to provide high quality result, you
>would need not only a motivated, well-intentioned group of people, but
>some of the smartest people in the field working on it.  I doubt we
>could more than kickstart an effort, but perhaps financial backing at
>significant scale could at least help a non-profit, open source effort
>to develop enough critical mass to go somewhere.

Wiktionary. If you want to help free software efforts in the area of
machine translation, then what they seem to need most is high quality
data about words, word forms, and so on, in a readily machine-usable
form, and freely licensed. Wiktionary does collect and mark-up this
data, but there is no easy way to get it out of Wiktionary. Fix that,
and people will build machine translation and other tools with it.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Gerard Meijssen
Hoi,
When we invest in MT it is to convey knowledge, information and primarily
Wikipedia articles. They do not have the same problems poetry has. With
explanatory articles on a subject there is a web of associated concepts.
These concepts are likely to occur in any language if the subject exists in
that other language.

Consequently MT can work for Wikipedia and provide quite a solid
interpretation of what the article is about. This is helped when the
associated concepts are recognised as such and when the translations for
these concepts are used in the MT.
Thanks,
  GerardM


On 26 April 2013 10:38, Mathieu Stumpf wrote:

> Le 2013-04-25 20:56, Theo10011 a écrit :
>
>  As far as Linguistic typology goes, it's far too unique and too varied to
>> have a language independent form develop as easily. Perhaps it also
>> depends
>> on the perspective. For example, the majority of people commenting here
>> (Americans, Europeans) might have exposure to a limited set of a
>> linguistic
>> branch. Machine-translations as someone pointed out, are still not
>> preferred in some languages, even with years of research and potentially
>> unlimited resources at Google's disposal, they still come out sounding
>> clunky in some ways. And perhaps they will never get to the level of
>> absolute, where they are truly language independent.
>>
>
> To my mind, there's no such thing as "absolute" meaning. It's all about
> intrepretation in a given a context by a given interpreter. I mean, I do
> think that MT could probably be as good as a profesional translators. But
> even profesional translators can't make "perfect translations". I already
> gave the example of poetry, but you may also take example of humour, which
> ask for some cultural background, otherwise you have to explain why it's
> funny and you know that you have to explain a joke, it's not a joke.
>
>
>  If you read some of
>> the discussions in linguistic relativity (Sapir-Whorf hypothesis), there
>> is
>> research to suggest that a language a person is born with dictates their
>> thought processes and their view of the world - there might not be
>> absolutes when it comes to linguistic cognition. There is something
>> inherently unique in the cognitive patterns of different languages.
>>
>
> That's just how learning process work, you can't "understand" something
> you didn't experiment. Reading an algorithm won't give you the insight
> you'll get when you process it mentaly (with the help of pencil and paper)
> and a textual description of "making love" won't provide you the feeling it
> provide.
>
>
>
>  Which brings me to the point, why not English? Your idea seems plausible
>> enough even if your remove the abstract idea of complete language
>> universality, without venturing into the science-fiction labyrinth of
>> man-machine collaboration.
>>
>
> English have many so called "non-neutral" problems. As far as I know, if
> the goal is to use syntactically unambiguous human language, lojban is the
> best current candidate. English as an international language is a very
> harmful situation. Believe it or not, but I sometime have to translate to
> English sentences which are written in French, because the writer was
> thinking with English idiomatic locution that he poorly translated to
> French, its native language in which it doesn't know the idiomatic
> locution. Even worst, I red people which where where using concepts with an
> English locution because they never matched it with the French locution
> that they know. And in the other way, I'm not sure that having millions of
> people speaking a broken English is a wonderful situation for this language.
>
> Search "why not english as international language" if you need more
> documentation.
>
>
> --
> Association Culture-Libre
> http://www.culture-libre.org/
>
> __**_
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.**org 
> Unsubscribe: 
> https://lists.wikimedia.org/**mailman/listinfo/wikimedia-l
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Jane Darnell
We already have the translation options on the left side of the screen
in any Wikipedia article.
This choice is generally a smattering of languages, and a long term
goal for many small-language Wikipedias is to be able to translate an
article from related languages (say from Dutch into Frisian, where the
Frisian Wikipedia has no article at all on the title subject) and the
even longer-term goal is to translate into some other
really-really-really foreign language.

Wouldn't it be easier however, to start with a project that uses
translatewiki and the related-language pairs? Usually there is a big
difference in numbers of articles (like between the Dutch Wikipedia
and the Frisian Wikipedia). Presumably the demand is larger on the
destination wikipedia (because there are fewer articles in those
languages), and the potential number of human translators is larger
(because most editors active in the smaller Wikipedia are versed in
both langages).

The Dutch Wikimedia chapter took part in a European multilingual
synchronization tool project called CoSyne:
http://cosyne.eu/index.php/Main_Page

It was not a success, because it was hard to figure out how this would
be beneficial to Wikipedians actually joining the project. Some
funding that was granted to the chapter to work on the project will be
returned, because it was never spent.

In order to tackle this problem on a large scale, it needs to be
broken down into words, sentences, paragraphs and perhaps other
structures (category trees?). I think CoSyne was trying to do this. I
think it would be easier to keep the effort in one-way-traffic, so try
to offer machine translation from Dutch to Frisian and not the other
way around, and then as you go, define concepts that work both ways,
so that eventually it would be possible to translated from Frisian
into Dutch.

2013/4/26, Mathieu Stumpf :
> Le 2013-04-25 20:56, Theo10011 a écrit :
>> As far as Linguistic typology goes, it's far too unique and too
>> varied to
>> have a language independent form develop as easily. Perhaps it also
>> depends
>> on the perspective. For example, the majority of people commenting
>> here
>> (Americans, Europeans) might have exposure to a limited set of a
>> linguistic
>> branch. Machine-translations as someone pointed out, are still not
>> preferred in some languages, even with years of research and
>> potentially
>> unlimited resources at Google's disposal, they still come out
>> sounding
>> clunky in some ways. And perhaps they will never get to the level of
>> absolute, where they are truly language independent.
>
> To my mind, there's no such thing as "absolute" meaning. It's all about
> intrepretation in a given a context by a given interpreter. I mean, I do
> think that MT could probably be as good as a profesional translators.
> But even profesional translators can't make "perfect translations". I
> already gave the example of poetry, but you may also take example of
> humour, which ask for some cultural background, otherwise you have to
> explain why it's funny and you know that you have to explain a joke,
> it's not a joke.
>
>> If you read some of
>> the discussions in linguistic relativity (Sapir-Whorf hypothesis),
>> there is
>> research to suggest that a language a person is born with dictates
>> their
>> thought processes and their view of the world - there might not be
>> absolutes when it comes to linguistic cognition. There is something
>> inherently unique in the cognitive patterns of different languages.
>
> That's just how learning process work, you can't "understand" something
> you didn't experiment. Reading an algorithm won't give you the insight
> you'll get when you process it mentaly (with the help of pencil and
> paper) and a textual description of "making love" won't provide you the
> feeling it provide.
>
>
>> Which brings me to the point, why not English? Your idea seems
>> plausible
>> enough even if your remove the abstract idea of complete language
>> universality, without venturing into the science-fiction labyrinth of
>> man-machine collaboration.
>
> English have many so called "non-neutral" problems. As far as I know,
> if the goal is to use syntactically unambiguous human language, lojban
> is the best current candidate. English as an international language is a
> very harmful situation. Believe it or not, but I sometime have to
> translate to English sentences which are written in French, because the
> writer was thinking with English idiomatic locution that he poorly
> translated to French, its native language in which it doesn't know the
> idiomatic locution. Even worst, I red people which where where using
> concepts with an English locution because they never matched it with the
> French locution that they know. And in the other way, I'm not sure that
> having millions of people speaking a broken English is a wonderful
> situation for this language.
>
> Search "why not english as international language" if you need more
> documentation.
>
> 

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Mathieu Stumpf

Le 2013-04-25 20:56, Theo10011 a écrit :
As far as Linguistic typology goes, it's far too unique and too 
varied to
have a language independent form develop as easily. Perhaps it also 
depends
on the perspective. For example, the majority of people commenting 
here
(Americans, Europeans) might have exposure to a limited set of a 
linguistic

branch. Machine-translations as someone pointed out, are still not
preferred in some languages, even with years of research and 
potentially
unlimited resources at Google's disposal, they still come out 
sounding

clunky in some ways. And perhaps they will never get to the level of
absolute, where they are truly language independent.


To my mind, there's no such thing as "absolute" meaning. It's all about 
intrepretation in a given a context by a given interpreter. I mean, I do 
think that MT could probably be as good as a profesional translators. 
But even profesional translators can't make "perfect translations". I 
already gave the example of poetry, but you may also take example of 
humour, which ask for some cultural background, otherwise you have to 
explain why it's funny and you know that you have to explain a joke, 
it's not a joke.



If you read some of
the discussions in linguistic relativity (Sapir-Whorf hypothesis), 
there is
research to suggest that a language a person is born with dictates 
their

thought processes and their view of the world - there might not be
absolutes when it comes to linguistic cognition. There is something
inherently unique in the cognitive patterns of different languages.


That's just how learning process work, you can't "understand" something 
you didn't experiment. Reading an algorithm won't give you the insight 
you'll get when you process it mentaly (with the help of pencil and 
paper) and a textual description of "making love" won't provide you the 
feeling it provide.



Which brings me to the point, why not English? Your idea seems 
plausible

enough even if your remove the abstract idea of complete language
universality, without venturing into the science-fiction labyrinth of
man-machine collaboration.


English have many so called "non-neutral" problems. As far as I know, 
if the goal is to use syntactically unambiguous human language, lojban 
is the best current candidate. English as an international language is a 
very harmful situation. Believe it or not, but I sometime have to 
translate to English sentences which are written in French, because the 
writer was thinking with English idiomatic locution that he poorly 
translated to French, its native language in which it doesn't know the 
idiomatic locution. Even worst, I red people which where where using 
concepts with an English locution because they never matched it with the 
French locution that they know. And in the other way, I'm not sure that 
having millions of people speaking a broken English is a wonderful 
situation for this language.


Search "why not english as international language" if you need more 
documentation.


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Leinonen Teemu
On 24.4.2013, at 9.29, Erik Moeller  wrote:
> Could open source MT be such a strategic investment?

Great idea. If we think resources, human languages are definitely a resource 
that should be kept in commons. 

- Teemu 

--
Teemu Leinonen
http://www2.uiah.fi/~tleinone/
+358 50 351 6796
Media Lab
http://mlab.uiah.fi
Aalto University 
School of Arts, Design and Architecture
--


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Shlomi Fish
Hi all,

On Wed, 24 Apr 2013 08:39:55 +0200
Ting Chen  wrote:

> Oh yes, this would really be great. Just think about the money the 
> Foundation gives out meanwhile for translation, plus the many many 
> volunteers' work invested into translation. A free and open translation 
> software is long overdue indeed. Great idea Erik.
> 

unfortunately, I don't think we can expect that with the current state of the
art that a machine translation would do as good a job as a human translator,
so don't hold your hopes up for that. For example if we translate
http://shlomif.livejournal.com/63847.html to English with Google Translate we
get:
http://translate.google.com/translate?sl=iw&tl=en&js=n&prev=_t&hl=en&ie=UTF-8&eotf=1&u=http%3A%2F%2Fshlomif.livejournal.com%2F63847.html&act=url

<
Yotam and "hifh own and the Geek "

I have been offered several times to participate B"hifh and the Geek "and I
refused. Those who have forgotten, this is what is said in the Bible parable of
Jotham :

And they told Jotham, he went and stood on a mountain top - Gerizim, and
lifted up his voice and called; And said to them - they heard me Shechem, and
God will hear you:

The trees went forth anointed king over them.
And they said olive Malka us!
Olive said unto them: I stopped the - fertilizers, which - I will honor God
and man - And go to the - the trees!
And the trees said to the fig: Go - the Kings of us!
The fig tree said unto them: I stopped the - sweetness, and - good yield -
And go to the - the trees!
And the trees said to the vine: Go - the Kings of us!
Vine said unto them: I stopped the - Tirosh, auspicious God and man -
And go to the - the trees!
And tell all - the trees to the - bramble: You're the king - on us!
And bramble said to the - trees: If in truth ye anoint me king over you -
come and take refuge in my shade; If - no - let fire come out - the bramble,
and devour the - cedars of Lebanon! 


Sounds incredibly awkward and the main text was taken from
http://www.heraldmag.org/literature/doc_12.htm .

So it hardly makes a good job and we cannot expect it to replace human
translations.

Regards,

Shlomi Fish

-- 
-
Shlomi Fish   http://www.shlomifish.org/
http://www.shlomifish.org/humour/Summerschool-at-the-NSA/

I don’t believe in fairies. Oops! A fairy died.
I don’t believe in fairies. Oops! Another fairy died.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread George Herbert
This subthread seems headed out into practical / applied epistemology, if
there is such a thing.

I am not sure if we can get from here to there; that said, a new structure
with language independent facts / information points that then got
machine-explained or described in a local language would be an interesting
structure to build an encyclopedia around.  Wikidata is a good idea but not
enough here.  I'm not sure the state of knowledge theory and practice is
good enough to do this, but I am suddenly more interested in IBM's Watson
project and some related knowledge / natural language interaction AI work...

This is very interesting, but probably less midterm-practical than machine
translation and the existing WP / other project data.


On Thu, Apr 25, 2013 at 8:46 AM, Denny Vrandečić <
denny.vrande...@wikimedia.de> wrote:

> 2013/4/25 Mathieu Stumpf 
>
> > What would be the limits you would expect from your solution, because you
> > can't expect to just "translate" everything. Form may be a part of the
> > meaning. It's clear that you can't translate a poem for example. Sur
> > wikipedia is not primary concerned about poetry, but it does treat the
> > subject.
> >
> >
> I don't know where the limits would be. Probably further then we think
> right now, but yes, they still would be there and severe. The nice thing is
> that we would be collecting data about the limits constantly, and could
> thus "feed" the system to further improve and grow. Not automatically (I
> guess, but bots would obviously also be allowed to work on the rules as
> well), but through human intelligence, analyzing the input and trying to
> refine and extend the rules.
>
> But, considering the already existing bot created articles, which number in
> the hundred thousands in languages like Swedish, Dutch, or Polish, there
> seems to be some consensus that this can be considered as a useful starting
> block. It's just that with the current system, even with Wikidata, we
> cannot really grow into this direction further.
>
> Cheers,
> Denny
>
> --
> Project director Wikidata
> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
> der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
> Körperschaften I Berlin, Steuernummer 27/681/51985.
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>



-- 
-george william herbert
george.herb...@gmail.com
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Tim Starling
On 24/04/13 16:29, Erik Moeller wrote:
> Are there open source MT efforts that are close enough to merit
> scrutiny? In order to be able to provide high quality result, you
> would need not only a motivated, well-intentioned group of people, but
> some of the smartest people in the field working on it.  I doubt we
> could more than kickstart an effort, but perhaps financial backing at
> significant scale could at least help a non-profit, open source effort
> to develop enough critical mass to go somewhere.

We could basically clone the frontend component of Google Translate,
and use Moses as a backend. The work would be mostly JavaScript, which
we can do. When VisualEditor wraps up, we'll have several JavaScript
developers looking for a project.

Google Translate gathers its own parallel corpus, and does it in a way
that's accessible to non-technical bilingual speakers, so I think it's
a nice model. The quality of its translations has improved enormously
over the years, and I suppose most of that change is due to improved
training data.

If we develop it as a public-facing open source product, then other
Moses users could start using it. We could host it on GitHub, so that
if it turns out to be popular, we could let it gradually evolve away
from WMF control.

Once the frontend tool is done, the next job would be to develop a
corpus sharing site, hosting any available freely-licensed output of
the frontend tool.

-- Tim Starling


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Theo10011
On Thu, Apr 25, 2013 at 7:56 PM, Denny Vrandečić <
denny.vrande...@wikimedia.de> wrote:

> Not just bootstrapping the content. By having the primary content be saved
> in a language independent form, and always translating it on the fly, it
> would not merely bootstrap content in different languages, but it would
> mean that editors from different languages would be working on the same
> content. The texts in the different language is not a translation of each
> other, but they are all created from the same source. There would be no
> primacy of, say, English.
>

This is a thought but I've never heard of a Language independent form. I
also question its importance to your core idea vs. say, a primary language
of choice. An argument can be made that language independent on a computer
medium can't exist, down to a programming language, the instructions and
even binary bits, there is a language running on top of higher inputs (even
transitioning between computer languages isn't at an absolute level)- to
that extent, I wonder if data can truly be language independent.

As far as Linguistic typology goes, it's far too unique and too varied to
have a language independent form develop as easily. Perhaps it also depends
on the perspective. For example, the majority of people commenting here
(Americans, Europeans) might have exposure to a limited set of a linguistic
branch. Machine-translations as someone pointed out, are still not
preferred in some languages, even with years of research and potentially
unlimited resources at Google's disposal, they still come out sounding
clunky in some ways. And perhaps they will never get to the level of
absolute, where they are truly language independent. If you read some of
the discussions in linguistic relativity (Sapir-Whorf hypothesis), there is
research to suggest that a language a person is born with dictates their
thought processes and their view of the world - there might not be
absolutes when it comes to linguistic cognition. There is something
inherently unique in the cognitive patterns of different languages.

Which brings me to the point, why not English? Your idea seems plausible
enough even if your remove the abstract idea of complete language
universality, without venturing into the science-fiction labyrinth of
man-machine collaboration.

Regards
Theo
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Denny Vrandečić
2013/4/25 Mathieu Stumpf 

> What would be the limits you would expect from your solution, because you
> can't expect to just "translate" everything. Form may be a part of the
> meaning. It's clear that you can't translate a poem for example. Sur
> wikipedia is not primary concerned about poetry, but it does treat the
> subject.
>
>
I don't know where the limits would be. Probably further then we think
right now, but yes, they still would be there and severe. The nice thing is
that we would be collecting data about the limits constantly, and could
thus "feed" the system to further improve and grow. Not automatically (I
guess, but bots would obviously also be allowed to work on the rules as
well), but through human intelligence, analyzing the input and trying to
refine and extend the rules.

But, considering the already existing bot created articles, which number in
the hundred thousands in languages like Swedish, Dutch, or Polish, there
seems to be some consensus that this can be considered as a useful starting
block. It's just that with the current system, even with Wikidata, we
cannot really grow into this direction further.

Cheers,
Denny

-- 
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Mathieu Stumpf

Le 2013-04-25 16:26, Denny Vrandečić a écrit :

Erik,

2013/4/25 Erik Moeller 


> The system I am really aiming at is a different one, and there has
> been plenty of related work in this direction: imagine a wiki 
where you
> enter or edit content, sentence by sentence, but the natural 
language
> representation is just a surface syntax for an internal structure. 
Your
> editing interface is a constrained, but natural language. Now, in 
order

to
> really make this fly, both the rules for the parsers (interpreting 
the
> input) and the serializer (creating the output) would need to be 
editable
> by the community - in addition to the content itself. There are a 
number

of
> major challenges involved, but I have by now a fair idea of how to 
tackle

> most of them (and I don't have the time to detail them right now).

So what would you want to enable with this? Faster bootstrapping of
content? How would it work, and how would this be superior to an
approach like the one taken in the Translate extension (basically,
providing good interfaces for 1:1 translation, tracking differences
between documents, and offering MT and translation memory based
suggestions)? Are there examples of this approach being taken
somewhere else?




Not just bootstrapping the content. By having the primary content be 
saved
in a language independent form, and always translating it on the fly, 
it
would not merely bootstrap content in different languages, but it 
would
mean that editors from different languages would be working on the 
same
content. The texts in the different language is not a translation of 
each
other, but they are all created from the same source. There would be 
no

primacy of, say, English.


What would be the limits you would expect from your solution, because 
you can't expect to just "translate" everything. Form may be a part of 
the meaning. It's clear that you can't translate a poem for example. Sur 
wikipedia is not primary concerned about poetry, but it does treat the 
subject.



--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Denny Vrandečić
2013/4/25 Brion Vibber 

> You are blowing my mind, dude. :)
>

Glad to do hear :)


I suspect this approach won't serve for everything, but it sounds
> *awesome*. If we can tie natural-language statements directly to data nodes
> (rather than merely annotating vague references like we do today), then
> we'd be much better able to keep language versions in sync. How to make
> them sane to edit... sounds harder. :)
>

Absolutely correct, it would not serve for everything. And it doesn't have
to. For an encyclopedia we should be able to get a useful amount of
"frames" in a decent timeframe. For song lyrics, it might take a bit longer.

It would and should start with a restricted set of possible frames, but the
trick would be to make the user extensible. Because that is where we are
good at -- users who fill and extend the frameworks we provide. I don't
know of much work where the frames and rules themselves are user editable
and extensible, but heck, they people said we are crazy when we made the
properties user editable and extensible in Semantic MediaWiki and later
Wikidata, and it seems to be working out.

A sane editing interface - both for the rules and the content, and their
interaction - would be something that would need to be explored first, just
to check whether this is indeed possible or just wishful thinking. Starting
without this kind of exploration beforehand would be a bit adventurous, or
optimistic.

Cheers,
Denny


-- 
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Brion Vibber
On Thu, Apr 25, 2013 at 7:26 AM, Denny Vrandečić <
denny.vrande...@wikimedia.de> wrote:

> Not just bootstrapping the content. By having the primary content be saved
> in a language independent form, and always translating it on the fly, it
> would not merely bootstrap content in different languages, but it would
> mean that editors from different languages would be working on the same
> content. The texts in the different language is not a translation of each
> other, but they are all created from the same source. There would be no
> primacy of, say, English.
>

You are blowing my mind, dude. :)

I suspect this approach won't serve for everything, but it sounds
*awesome*. If we can tie natural-language statements directly to data nodes
(rather than merely annotating vague references like we do today), then
we'd be much better able to keep language versions in sync. How to make
them sane to edit... sounds harder. :)

It would be foolish to create any such plan without reusing tools and
> concepts from the Translate extension, translation memories, etc. There is
> a lot of UI and conceptual goodness in these tools. The idea would be to
> make them user extensible with rules.
>
>
Heck yeah!

If you want, examples of that are the bots working on some Wikipedias
> currently, creating text from structured input. They are partially reusing
> the same structured input, and need "merely" a translation in the way the
> bots create the text to save in the given Wikipedia. I have seen some
> research in the area, but they all have one or the other drawbacks, but can
> and should be used as an inspiration and to inform the project (like
> Allegro Controlled English, or a Chat program developed at the Open
> University in Milton Keynes to allow conducting business in different
> languages, etc.)
>

Ye... make them real-time updatable instead of one-time bots producing
language which can't be maintained.

-- brion
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Denny Vrandečić
Erik,

2013/4/25 Erik Moeller 

> > The system I am really aiming at is a different one, and there has
> > been plenty of related work in this direction: imagine a wiki where you
> > enter or edit content, sentence by sentence, but the natural language
> > representation is just a surface syntax for an internal structure. Your
> > editing interface is a constrained, but natural language. Now, in order
> to
> > really make this fly, both the rules for the parsers (interpreting the
> > input) and the serializer (creating the output) would need to be editable
> > by the community - in addition to the content itself. There are a number
> of
> > major challenges involved, but I have by now a fair idea of how to tackle
> > most of them (and I don't have the time to detail them right now).
>
> So what would you want to enable with this? Faster bootstrapping of
> content? How would it work, and how would this be superior to an
> approach like the one taken in the Translate extension (basically,
> providing good interfaces for 1:1 translation, tracking differences
> between documents, and offering MT and translation memory based
> suggestions)? Are there examples of this approach being taken
> somewhere else?



Not just bootstrapping the content. By having the primary content be saved
in a language independent form, and always translating it on the fly, it
would not merely bootstrap content in different languages, but it would
mean that editors from different languages would be working on the same
content. The texts in the different language is not a translation of each
other, but they are all created from the same source. There would be no
primacy of, say, English.

It would be foolish to create any such plan without reusing tools and
concepts from the Translate extension, translation memories, etc. There is
a lot of UI and conceptual goodness in these tools. The idea would be to
make them user extensible with rules.

If you want, examples of that are the bots working on some Wikipedias
currently, creating text from structured input. They are partially reusing
the same structured input, and need "merely" a translation in the way the
bots create the text to save in the given Wikipedia. I have seen some
research in the area, but they all have one or the other drawbacks, but can
and should be used as an inspiration and to inform the project (like
Allegro Controlled English, or a Chat program developed at the Open
University in Milton Keynes to allow conducting business in different
languages, etc.)

I hope this helps a bit.

Cheers,
Denny

 --
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Mathieu Stumpf

Le 2013-04-25 04:49, George Herbert a écrit :
We can't usefully help with internet access (and that's proceeding at 
good
pace even in the third world), but language will remain a barrier 
when
people get access.  In a few situations politics / firewalling is as 
well
(China, primarily), which is another strategic challenge.  That, 
however,
is political and geopolitical, and not an easy nut for WMF to crack.  
Of
the three issues - net, firewalling, and language, one of them is 
something
we can work on.  We should think about how to work on that.  MT seems 
like

an obvious answer, but not the only possible one.


Do you have specific ideas in mind? Apart from having an "international 
language" and pedagogic material accessible to everyone and able to 
teach them from a zero knowledge requirement, I fail to see many 
options. Personally I'm currently learning esperanto as I would be happy 
to participate in such a process. I learn esperanto because it seems the 
current most successful language for such a project. It's already used 
on official china sites, and there's a current petition you can sign to 
make it an official european language[1].


[1] 
https://secure.avaaz.org/en/petition/Esperanto_langue_officielle_de_lUE/



--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Erik Moeller
Denny,

very good and compelling reasoning as always. I think the argument
that we can potentially do a lot for the MT space (including open
source efforts) in part by getting our own house in order on the
dictionary side of things makes a lot of sense. I don't think it
necessarily excludes investing in open source MT efforts, but Mark
makes a good point that there are already existing institutions
pouring money into promising initiatives. Let me try to understand
some of the more complex ideas outlined in your note a bit better.

> The system I am really aiming at is a different one, and there has
> been plenty of related work in this direction: imagine a wiki where you
> enter or edit content, sentence by sentence, but the natural language
> representation is just a surface syntax for an internal structure. Your
> editing interface is a constrained, but natural language. Now, in order to
> really make this fly, both the rules for the parsers (interpreting the
> input) and the serializer (creating the output) would need to be editable
> by the community - in addition to the content itself. There are a number of
> major challenges involved, but I have by now a fair idea of how to tackle
> most of them (and I don't have the time to detail them right now).

So what would you want to enable with this? Faster bootstrapping of
content? How would it work, and how would this be superior to an
approach like the one taken in the Translate extension (basically,
providing good interfaces for 1:1 translation, tracking differences
between documents, and offering MT and translation memory based
suggestions)? Are there examples of this approach being taken
somewhere else?

Thanks,
Erik

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Nikola Smolenski

On 24/04/13 12:35, Denny Vrandečić wrote:

In summary, I see four calls for action right now (and for all of them this
means to first actually think more and write down a project plan and gather
input on that), that could and should be tackled in parallel if possible:
I ) develop  a structured Wiktionary
II ) develop a feature that blends into Wikipedia's search if an article
about a topic does not exist yet, but we  have data on Wikidata about that
topic
III ) develop a multilingual search, tagging, and structuring environment
for Commons
IV ) develop structured Wiki content using natural language as a surface
syntax, with extensible parsers and serializers

None of these goals would require tens of millions or decades of research
and development. I think we could have an actionable plan developed within
a month or two for all four goals, and my gut feeling is we could reach
them all by 2015 or 16, depending when we actually start with implementing
them.


I fully support this, though! This is fully within Wikimedia's current 
infrastructure, and generally was planned to be done anyway.


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Nikola Smolenski

On 24/04/13 12:35, Denny Vrandečić wrote:

Current machine translation research aims at using massive machine learning
supported systems. They usually require big parallel corpora. We do not
have big parallel corpora (Wikipedia articles are not translations of each
other, in general), especially not for many languages, and there is no


Could you define "big"? If 10% of Wikipedia articles are translations of 
each other, we have 2 million translation pairs. Assuming ten sentences 
per average article, this is 20 million sentence pairs. An average 
Wikipedia with 100,000 articles would have 10,000 translations and 
100,000 sentence pairs; a large Wikipedia with 1,000,000 articles would 
have 100,000 translations and 1,000,000 sentence pairs - is this not 
enough to kickstart a massive machine learning supported system? 
(Consider also that the articles are somewhat similar in structure and 
less rich than general text - future tense is rarely used for example.)


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Ryu Cheol
Thank you for my learning on what you are going forward with Wikidata, Denny.

I am a Korean Wikipedia contributor. I definitely agree with Erik that we have 
to tackle the problem of information disparity between languages. But I feel we 
can take better choices than investing to open source machine translation 
itself. Wikipedia content could be reused for commercial purposes. We know it 
will help the spreading of the Wikipedia. I think it is all the same. If 
proprietary machine translations could help the getting rid of the barrier of 
the language, it would great also. I hope we could support any machine 
translation developing team as well as open source machine  translation team. 
But I believe finally open source machine translation will prevail.

Wikidata-based approaches are great! But I hope Wikipedia could do more 
including providing well aligned parallel corpora. I had looked into Google's 
translation workbench which tried to provide a customized translation tool for 
Wikipedia. I tried to translate a few English articles into Korean myself. The 
tool has a translation memory and a customizable dictionary. It lacked lots of 
features for practical translation and the interface was clumsy. 

I believe translatewiki.net could do better than Google. I hope the 
translatewiki could provide a translation workbench not just for messages in 
softwares but Wikipedia articles. Through the workbenk, we could get out more 
great data in addition to parallel corpus. We can track how a human translator 
works. If we have more data on the editing activity, we can improve the 
translation job and get new clues for automatic translation. 
The translator will start from a stub and he will improve the draft. Peer 
reviewers will give eyes on the draft and will make it better. I mean logs for 
collaborated translation on a parallel corpora could provide more things to 
learn. 

I think Wikipedia community could start an initiative for supporting raw 
materials for machine learning to translate. Those would be common asset for 
machine translation systems. 

Best regards

RYU Cheol
Chair of Wikimedia Korea Preparation Committee


2013. 4. 24., 오후 7:35, Denny Vrandečić  작성:

> Erik, all,
> 
> sorry for the long mail.
> 
> Incidentally, I have been thinking in this direction myself for a while,
> and I have come to a number of conclusions:
> 1) the Wikimedia movement can not, in its current state, tackle the problem
> of machine translation of arbitrary text from and to all of our supported
> languages
> 2) the Wikimedia movement is probably the single most important source of
> training data already. Research that I have done with colleagues based on
> Wikimedia corpora as training data easily beat other corpora, and others
> are using Wikimedia corpora routinely already. There is not much we can
> improve here, actually
> 3) Wiktionary could be an even more amazing resource if we would finally
> tackle the issue of structuring its content more appropriately. I think
> Wikidata opened a few venues to structure planning in this direction and
> provide some software, but this would have the potential to provide more
> support for any external project than many other things we could tackle
> 
> Looking at the first statement, there are two ways we could constrain it to
> make it possibly feasible:
> a) constrain the number of supported languages. Whereas this would be
> technically the simpler solution, I think there is agreement that this is
> not in our interest at all
> b) constrain the kind of input text we want to support
> 
> If we constrain b) a lot, we could just go and develop "pages to display
> for pages that do not exist yet based on Wikidata" in the smaller
> languages. That's a far cry from machine translating the articles, but it
> would be a low hanging fruit. And it might help with a desire which is
> evidently strongly expressed by the mass creation of articles through bots
> in a growing number of languages. Even more constraints would still allow
> us to use Wikidata items for tagging and structuring Commons in a
> language-independent way (this was suggested by Erik earlier).
> 
> Current machine translation research aims at using massive machine learning
> supported systems. They usually require big parallel corpora. We do not
> have big parallel corpora (Wikipedia articles are not translations of each
> other, in general), especially not for many languages, and there is no
> reason to believe this is going to change. I would question if we want to
> build an infrastructure for gathering those corpora from the Web
> continuously. I do not think we can compete in this arena, or that is the
> best use of our resources to support projects in this area. We should use
> our unique features to our advantage.
> 
> How can we use the unique features of the Wikimedia movement to our
> advantage? What are our unique features? Well, obviously, the awesome
> community we are. Our technology, as amazing a

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread George Herbert
Leslie Carr wrote (personally, not officially):

I think that while supporting open source machine translation is an
> awesome goal, it is out of scope of our budget and the engineering
> budget could be better spent elsewhere, such as with completing
> existing tools that are in development, but not
> deployed/optimized/etc.  I think that putting a bunch of money into
> possibilities isn't the right thing to do when we have a lot of
> projects that need to be finished and deployed yesterday.  Maybe once
> there's a closer actual project we could support them with text
> streams, decommissioned machines, and maybe money, but only after it's
> a pretty sure "investment"


I don't think that it's a good idea to shift resources to it immediately,
but I think that every now and then it's very healthy to step back and ask
"What is standing between our users and the information they seek?  What is
standing between our editors and the information they want to update?".
 Generically, the customers and customer goals problem, applied to WMF's
two customer sets (readers, and editors).

Minor UI changes help readers.  Most of the other changes are
editor-focused, retention or ease of editing or various other things
related to that.  A few are strategic-data-organization related which are
more of a multiplier effect.

The readers and potential readers ARE however clearly disadvantaged by
translation issues.

I see this discussion and consideration as strategic; not planning (year,
six month) timescales or tactical (month, week) timescales, but a
multi-year "What are our main goals for information access?" timescale.

We can't usefully help with internet access (and that's proceeding at good
pace even in the third world), but language will remain a barrier when
people get access.  In a few situations politics / firewalling is as well
(China, primarily), which is another strategic challenge.  That, however,
is political and geopolitical, and not an easy nut for WMF to crack.  Of
the three issues - net, firewalling, and language, one of them is something
we can work on.  We should think about how to work on that.  MT seems like
an obvious answer, but not the only possible one.




On Wed, Apr 24, 2013 at 12:29 PM, Leslie Carr  wrote:

> (FYI this is me speaking with my personal hat on, none of these
> opinions are official in any way or the opinions of the foundation as
> an organization)
>
> 
>
> >
> > While Wikimedia is still only a medium-sized organization, it is not
> > poor. With more than 1M donors supporting our mission and a cash
> > position of $40M, we do now have a greater ability to make strategic
> > investments that further our mission, as communicated to our donors.
> > That's a serious level of trust and not to be taken lightly, either by
> > irresponsibly spending, or by ignoring our ability to do good.
> >
> > Could open source MT be such a strategic investment? I don't know, but
> > I'd like to at least raise the question. I think the alternative will
> > be, for the foreseeable future, to accept that this piece of
> > technology will be proprietary, and to rely on goodwill for any
> > integration that concerns Wikimedia. Not the worst outcome, but also
> > not the best one.
>
> I think that while supporting open source machine translation is an
> awesome goal, it is out of scope of our budget and the engineering
> budget could be better spent elsewhere, such as with completing
> existing tools that are in development, but not
> deployed/optimized/etc.  I think that putting a bunch of money into
> possibilities isn't the right thing to do when we have a lot of
> projects that need to be finished and deployed yesterday.  Maybe once
> there's a closer actual project we could support them with text
> streams, decommissioned machines, and maybe money, but only after it's
> a pretty sure "investment"
>
> 
>
> Leslie
>
> >
> > Are there open source MT efforts that are close enough to merit
> > scrutiny? In order to be able to provide high quality result, you
> > would need not only a motivated, well-intentioned group of people, but
> > some of the smartest people in the field working on it.  I doubt we
> > could more than kickstart an effort, but perhaps financial backing at
> > significant scale could at least help a non-profit, open source effort
> > to develop enough critical mass to go somewhere.
> >
> > All best,
> > Erik
> >
> > [1]
> http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
> > [2] https://developers.google.com/translate/v2/pricing
> > --
> > Erik Möller
> > VP of Engineering and Product Development, Wikimedia Foundation
> >
> > Wikipedia and our other projects reach more than 500 million people every
> > month. The world population is estimated to be >7 billion. Still a long
> > way to go. Support us. Join us. Share: https://wikimediafoundation.org/
> >
> > ___
> > Wikimedia-l mailing list
> > Wikimedia-l@lists.

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Leslie Carr
(FYI this is me speaking with my personal hat on, none of these
opinions are official in any way or the opinions of the foundation as
an organization)



>
> While Wikimedia is still only a medium-sized organization, it is not
> poor. With more than 1M donors supporting our mission and a cash
> position of $40M, we do now have a greater ability to make strategic
> investments that further our mission, as communicated to our donors.
> That's a serious level of trust and not to be taken lightly, either by
> irresponsibly spending, or by ignoring our ability to do good.
>
> Could open source MT be such a strategic investment? I don't know, but
> I'd like to at least raise the question. I think the alternative will
> be, for the foreseeable future, to accept that this piece of
> technology will be proprietary, and to rely on goodwill for any
> integration that concerns Wikimedia. Not the worst outcome, but also
> not the best one.

I think that while supporting open source machine translation is an
awesome goal, it is out of scope of our budget and the engineering
budget could be better spent elsewhere, such as with completing
existing tools that are in development, but not
deployed/optimized/etc.  I think that putting a bunch of money into
possibilities isn't the right thing to do when we have a lot of
projects that need to be finished and deployed yesterday.  Maybe once
there's a closer actual project we could support them with text
streams, decommissioned machines, and maybe money, but only after it's
a pretty sure "investment"



Leslie

>
> Are there open source MT efforts that are close enough to merit
> scrutiny? In order to be able to provide high quality result, you
> would need not only a motivated, well-intentioned group of people, but
> some of the smartest people in the field working on it.  I doubt we
> could more than kickstart an effort, but perhaps financial backing at
> significant scale could at least help a non-profit, open source effort
> to develop enough critical mass to go somewhere.
>
> All best,
> Erik
>
> [1] 
> http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
> [2] https://developers.google.com/translate/v2/pricing
> --
> Erik Möller
> VP of Engineering and Product Development, Wikimedia Foundation
>
> Wikipedia and our other projects reach more than 500 million people every
> month. The world population is estimated to be >7 billion. Still a long
> way to go. Support us. Join us. Share: https://wikimediafoundation.org/
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l



--
Leslie Carr
Wikimedia Foundation
AS 14907, 43821
http://as14907.peeringdb.com/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Samuel Klein
I really like Erik's original suggestion, and these ideas, Denny.

Since there are many different possible goals, it's worth having a
page just to list all of the possible different goals and compare them
- both how they fit with one another and how they fit with existing
active projects elsewhere on the web.

SJ

On Wed, Apr 24, 2013 at 6:35 AM, Denny Vrandečić
 wrote:
> Erik, all,
>
> sorry for the long mail.
>
> Incidentally, I have been thinking in this direction myself for a while,
> and I have come to a number of conclusions:
> 1) the Wikimedia movement can not, in its current state, tackle the problem
> of machine translation of arbitrary text from and to all of our supported
> languages
> 2) the Wikimedia movement is probably the single most important source of
> training data already. Research that I have done with colleagues based on
> Wikimedia corpora as training data easily beat other corpora, and others
> are using Wikimedia corpora routinely already. There is not much we can
> improve here, actually
> 3) Wiktionary could be an even more amazing resource if we would finally
> tackle the issue of structuring its content more appropriately. I think
> Wikidata opened a few venues to structure planning in this direction and
> provide some software, but this would have the potential to provide more
> support for any external project than many other things we could tackle
>
> Looking at the first statement, there are two ways we could constrain it to
> make it possibly feasible:
> a) constrain the number of supported languages. Whereas this would be
> technically the simpler solution, I think there is agreement that this is
> not in our interest at all
> b) constrain the kind of input text we want to support
>
> If we constrain b) a lot, we could just go and develop "pages to display
> for pages that do not exist yet based on Wikidata" in the smaller
> languages. That's a far cry from machine translating the articles, but it
> would be a low hanging fruit. And it might help with a desire which is
> evidently strongly expressed by the mass creation of articles through bots
> in a growing number of languages. Even more constraints would still allow
> us to use Wikidata items for tagging and structuring Commons in a
> language-independent way (this was suggested by Erik earlier).
>
> Current machine translation research aims at using massive machine learning
> supported systems. They usually require big parallel corpora. We do not
> have big parallel corpora (Wikipedia articles are not translations of each
> other, in general), especially not for many languages, and there is no
> reason to believe this is going to change. I would question if we want to
> build an infrastructure for gathering those corpora from the Web
> continuously. I do not think we can compete in this arena, or that is the
> best use of our resources to support projects in this area. We should use
> our unique features to our advantage.
>
> How can we use the unique features of the Wikimedia movement to our
> advantage? What are our unique features? Well, obviously, the awesome
> community we are. Our technology, as amazing as it is, running our Websites
> on the given budget, is nevertheless not what makes us what we are. Most
> processes on the Wikimedia projects are developed in the community space,
> and not implemented in bits. To summon Lessing, if code is law, Wikimedia
> projects are really good in creating a space that allows for a community to
> live in this space and have the freedom to create their own ecosystem.
>
> One idea I have been mulling over for years is basically how can we use
> this advantage for the task of creating content available in many
> languages. Wikidata is an obvious attempt at that, but it really goes only
> so far. The system I am really aiming at is a different one, and there has
> been plenty of related work in this direction: imagine a wiki where you
> enter or edit content, sentence by sentence, but the natural language
> representation is just a surface syntax for an internal structure. Your
> editing interface is a constrained, but natural language. Now, in order to
> really make this fly, both the rules for the parsers (interpreting the
> input) and the serializer (creating the output) would need to be editable
> by the community - in addition to the content itself. There are a number of
> major challenges involved, but I have by now a fair idea of how to tackle
> most of them (and I don't have the time to detail them right now). Wikidata
> had some design decision inside it that are already geared towards enabling
> the solution for some of the problems for this kind of wiki. Whatever a
> structured Wiktionary would look like, it should also be aligned with the
> requirements of the project outlined here. Basically, we take constrain b,
> but make it possible to push the constraint further and further through the
> community - that's how we could scale on this task.
>
> This would be far

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Mathieu Stumpf

Le 2013-04-24 12:35, Denny Vrandečić a écrit :
3) Wiktionary could be an even more amazing resource if we would 
finally
tackle the issue of structuring its content more appropriately. I 
think
Wikidata opened a few venues to structure planning in this direction 
and
provide some software, but this would have the potential to provide 
more
support for any external project than many other things we could 
tackle


If you have any information/idea related to Wikitionary structuration, 
please share it on https://meta.wikimedia.org/wiki/Wiktionary_future



One idea I have been mulling over for years is basically how can we 
use

this advantage for the task of creating content available in many
languages. Wikidata is an obvious attempt at that, but it really goes 
only
so far. The system I am really aiming at is a different one, and 
there has
been plenty of related work in this direction: imagine a wiki where 
you

enter or edit content, sentence by sentence, but the natural language
representation is just a surface syntax for an internal structure.


I don't understand what you mean. To begin with, I doubt that sentence 
is the good scale to translate a natural language discourse. Sure some 
time you may translate one word with one word in an other language. 
Sometime you may translate a sentence with one sentence. Sometime you 
need to grab the whole paragraph, or even more, and sometime you need to 
have a whole cultural background to get the meaning of a single word in 
the current context. To my mind, natural languages deals with more than 
context free language. Could a static "internal structure" deal with 
such a dynamics?



Your
editing interface is a constrained, but natural language.


This is realy where I don't see how you hope to manage that.


Now, in order to
really make this fly, both the rules for the parsers (interpreting 
the
input) and the serializer (creating the output) would need to be 
editable
by the community - in addition to the content itself. There are a 
number of
major challenges involved, but I have by now a fair idea of how to 
tackle

most of them (and I don't have the time to detail them right now).


Well I'll be curious to have more information, like references I should 
read. Otherwise I'm affraid that what you says sounds like the Fermat's 
Last Theorem[1] and the famous margin which was too small to contain 
Fermat's alleged proof of his "last theorem".



[1] https://en.wikipedia.org/wiki/Fermat%27s_Last_Theorem


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Andrea Zanni
On Wed, Apr 24, 2013 at 2:04 PM, Mathieu Stumpf <
psychosl...@culture-libre.org> wrote:

> I would like to add that (I'm no specialist of this subject) translating
> natural language probably need at least a large set of existing
> translations, at least to get read of "obvious well known" idiotisms like
> "kitchen sink" translated "usine à gaz" when you are speaking of a software
> for example. On this regard, we probably have such a base with wikisource.
> What do you think?


Personally, I think this is an awesome idea :-)
Wikisource corpora could be a huge asset in developing this.
We already host different public domain translations, and in the future, we
hope, more and more Wikisources will allow user generated translations.

At the moment, Wikisource could be a interesting corpora and laboratory for
improving and enhancing OCR,
as the OCR generated text is always proofread and corrected by humans.
As part of our project (
http://wikisource.org/wiki/Wikisource_vision_development), Micru was
looking for a GSoC candidate for studing the reinsertion of proofread text
into djvus [1], but at the moment didn't find any interested student. We
have some contacts with people at Google working on Tesseract, and they
were available for mentoring.

Aubrey

[1] We thought about this both for OCR enhancement purposes and files
updating on Commons and Internet Archive (which is off topic here).
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Fred Bauder
> On 24/04/13 08:29, Erik Moeller wrote:
>> Could open source MT be such a strategic investment? I don't know, but
>> I'd like to at least raise the question. I think the alternative will
>> be, for the foreseeable future, to accept that this piece of
>> technology will be proprietary, and to rely on goodwill for any
>> integration that concerns Wikimedia. Not the worst outcome, but also
>> not the best one.
>>
>> Are there open source MT efforts that are close enough to merit
>> scrutiny? In order to be able to provide high quality result, you
>> would need not only a motivated, well-intentioned group of people, but
>> some of the smartest people in the field working on it.  I doubt we
>> could more than kickstart an effort, but perhaps financial backing at
>> significant scale could at least help a non-profit, open source effort
>> to develop enough critical mass to go somewhere.
>
> A huge and worthwile effort on its own, and anyway a necessary step for
> creating free MT software, would be to build a free (as in freedom)
> parallel translation corpus. This corpus could then be used as the
> starting point by people and groups who are producing free MT software,
> either under WMF or on their own.
>
> This could be done by creating a new project where volunteers could
> compare Wikipedia articles and other free translated texts and mark
> sentences that are translations of other sentences. By the way, I
> believe Google Translate's corpus was created in this way.
>
> Perhaps this could be best achieved by teaming with www.zooniverse.org
> or www.pgdp.net who have experience in this kind of projects. This would
> require specialized non-wiki software, and I don't think that the
> Foundation has enough experience in developing it.
>
> (By the way, similar things that could be similarly useful include free
> OCR training data or free fully annotated text.)

The Bible is quite good for this.

Fred



___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Mathieu Stumpf

Le 2013-04-24 08:29, Erik Moeller a écrit :

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, 
but

some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source 
effort

to develop enough critical mass to go somewhere.


I would like to add that (I'm no specialist of this subject) 
translating natural language probably need at least a large set of 
existing translations, at least to get read of "obvious well known" 
idiotisms like "kitchen sink" translated "usine à gaz" when you are 
speaking of a software for example. On this regard, we probably have 
such a base with wikisource. What do you think?





All best,
Erik

[1]

http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
[2] https://developers.google.com/translate/v2/pricing
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

Wikipedia and our other projects reach more than 500 million people 
every
month. The world population is estimated to be >7 billion. Still a 
long
way to go. Support us. Join us. Share: 
https://wikimediafoundation.org/


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Nikola Smolenski

On 24/04/13 08:29, Erik Moeller wrote:

Could open source MT be such a strategic investment? I don't know, but
I'd like to at least raise the question. I think the alternative will
be, for the foreseeable future, to accept that this piece of
technology will be proprietary, and to rely on goodwill for any
integration that concerns Wikimedia. Not the worst outcome, but also
not the best one.

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.


A huge and worthwile effort on its own, and anyway a necessary step for 
creating free MT software, would be to build a free (as in freedom) 
parallel translation corpus. This corpus could then be used as the 
starting point by people and groups who are producing free MT software, 
either under WMF or on their own.


This could be done by creating a new project where volunteers could 
compare Wikipedia articles and other free translated texts and mark 
sentences that are translations of other sentences. By the way, I 
believe Google Translate's corpus was created in this way.


Perhaps this could be best achieved by teaming with www.zooniverse.org 
or www.pgdp.net who have experience in this kind of projects. This would 
require specialized non-wiki software, and I don't think that the 
Foundation has enough experience in developing it.


(By the way, similar things that could be similarly useful include free 
OCR training data or free fully annotated text.)


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Pavlo Shevelo
> only someone or something
that understands language can translate perfectly

Precisely

> crude translations into little used languages are nearly
worthless due to syntax issues. Useful work requires at least one person
fluent in the language

It's very true!
Current Googe MT tools are reasonably good for reader(s) as they really
provide a chance to grasp the meaning of the text but they are far not good
as writer instrument meaning translation results are far not good to be
published.

So it seems reasonable to promote MT as instrument for visitors (readers)
of our projects, but not as substitute for that Wikimedians, who are the
contributors.


On Wed, Apr 24, 2013 at 2:03 PM, Fred Bauder  wrote:

> This is closely tied to software which is being developed, some of it
> secretly, to enable machines to understand and use language. As of now
> this will be government and corporate owned and controlled. I say closely
> tied because that is how translation works; only someone or something
> that understands language can translate perfectly.
>
> That said, crude translations into little used languages are nearly
> worthless due to syntax issues. Useful work requires at least one person
> fluent in the language.
>
> Fred
>
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Andrea Zanni
> I've stated before why I disagree with this characterization, and I
> reject this framing. Functionality like the Visual Editor, the mobile
> site improvements, Lua, and other core engineering initiatives aren't
> limited in their impact to Wikipedia. The recent efforts on mobile
> uploading are actually focused on Commons. Deploying new software
> every two weeks and continually making key usability improvements is
> not what neglect looks like.
>

Thank you Erik for your response.
I don't agree with all of your points, but it's refreshing to see that
there's been what seem to be a lot of thought in this.
Often (we the 15 active users of sister projects) just feel nobody cares of
SP, and attention and thought and answers sometimes are just enough.

Anyway, I would just add that one of the major problem, I think,
is that when we think at the "human knowledge" in ("*I**magine a world in
which every single person on the planet* is given free access to the sum of
all human knowledge"),
we probably just think at "human knowledge in the form of neutral
encyclopedic articles", which, in fact, it's not true.

I feel that we could boost a lot the idea of a "family of projects", of an
integrated, global, comprehensive approach to knowledge.
Right now, the fact is that Wikipedia both attracts and cannibalizes users
to/from sister projects, which are kinda invisible if you don't know they
exist.

Could we promote better our sister projects, making them more visible?
For this purpose, user Micru and me just created a RfC for interproject
links
https://meta.wikimedia.org/wiki/Requests_for_comment/Interproject_links_interface
(I
invite you all to propose other solutions), but
the underlying question is if we, as the Wikimedia community, are aware of
the "theoretical" shift this means.

Aubrey
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Andrew Gray
On 24 April 2013 11:35, Denny Vrandečić  wrote:

> If we constrain b) a lot, we could just go and develop "pages to display
> for pages that do not exist yet based on Wikidata" in the smaller
> languages. That's a far cry from machine translating the articles, but it
> would be a low hanging fruit. And it might help with a desire which is
> evidently strongly expressed by the mass creation of articles through bots
> in a growing number of languages.

There has historically been a lot of tension around mass-creation of
articles because of the maintenance problem - we can create two
hundred thousand stubs in Tibetan or Tamil, but who will maintain
them? Wikidata gives us the potential of squaring that circle, and in
fact you bring it up here...

> II ) develop a feature that blends into Wikipedia's search if an article
> about a topic does not exist yet, but we  have data on Wikidata about that
> topic

I think this would be amazing. A software hook that says "we know X
article does not exist yet, but it is matched to Y topic on Wikidata"
and pulls out core information, along with a set of localised
descriptions... we gain all the benefit of having stub articles
(scope, coverage) without the problems of a small community having to
curate a million pages. It's not the same as hand-written content, but
it's immeasurably better than no content, or even an attempt at
machine-translating free text.

XXX is [a species of: fish] [in the: Y family]. It [is found in: Laos,
Vietnam]. It [grows to: 20 cm]. (pictures)

Wikidata Phase 4, perhaps :-)

-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Fred Bauder
This is closely tied to software which is being developed, some of it
secretly, to enable machines to understand and use language. As of now
this will be government and corporate owned and controlled. I say closely
tied because that is how translation works; only someone or something
that understands language can translate perfectly.

That said, crude translations into little used languages are nearly
worthless due to syntax issues. Useful work requires at least one person
fluent in the language.

Fred

>
> Could open source MT be such a strategic investment? I don't know, but
> I'd like to at least raise the question. I think the alternative will
> be, for the foreseeable future, to accept that this piece of
> technology will be proprietary, and to rely on goodwill for any
> integration that concerns Wikimedia. Not the worst outcome, but also
> not the best one.
>
> Are there open source MT efforts that are close enough to merit
> scrutiny? In order to be able to provide high quality result, you
> would need not only a motivated, well-intentioned group of people, but
> some of the smartest people in the field working on it.  I doubt we
> could more than kickstart an effort, but perhaps financial backing at
> significant scale could at least help a non-profit, open source effort
> to develop enough critical mass to go somewhere.
>
> All best,
> Erik
>
> [1]
> http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
> [2] https://developers.google.com/translate/v2/pricing
> --
> Erik Möller
> VP of Engineering and Product Development, Wikimedia Foundation
>
> Wikipedia and our other projects reach more than 500 million people every
> month. The world population is estimated to be >7 billion. Still a long
> way to go. Support us. Join us. Share: https://wikimediafoundation.org/
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>



___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Mark

A brief addendum,

On 4/24/13 12:25 PM, Mark wrote:
From 2006 through 2012 [the ERC] allocated about $10m to kickstart 
open-source MT, though focused primarily on European languages, via 
the EuroMatrix (2006-09) and EuroMatrixPlus (2009-12) research projects.


Missed some projects. Seems the European Research Council is *really* 
pushing for this, with more like $20-25m overall. A few FP7 projects 
that may be useful to us:


* Let's MT! , which is supposed to organize 
resources to help organizations & companies build their own MT systems 
on open data and software, reducing reliance on closed-source cloud 
providers.


* MosesCore , 
focused mainly on improving Moses itself.


* The Multilingual Europe Technology Alliance 
, a giant consortium that 
seems to have a commitment to liberal licensing 



-Mark


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Denny Vrandečić
Erik, all,

sorry for the long mail.

Incidentally, I have been thinking in this direction myself for a while,
and I have come to a number of conclusions:
1) the Wikimedia movement can not, in its current state, tackle the problem
of machine translation of arbitrary text from and to all of our supported
languages
2) the Wikimedia movement is probably the single most important source of
training data already. Research that I have done with colleagues based on
Wikimedia corpora as training data easily beat other corpora, and others
are using Wikimedia corpora routinely already. There is not much we can
improve here, actually
3) Wiktionary could be an even more amazing resource if we would finally
tackle the issue of structuring its content more appropriately. I think
Wikidata opened a few venues to structure planning in this direction and
provide some software, but this would have the potential to provide more
support for any external project than many other things we could tackle

Looking at the first statement, there are two ways we could constrain it to
make it possibly feasible:
a) constrain the number of supported languages. Whereas this would be
technically the simpler solution, I think there is agreement that this is
not in our interest at all
b) constrain the kind of input text we want to support

If we constrain b) a lot, we could just go and develop "pages to display
for pages that do not exist yet based on Wikidata" in the smaller
languages. That's a far cry from machine translating the articles, but it
would be a low hanging fruit. And it might help with a desire which is
evidently strongly expressed by the mass creation of articles through bots
in a growing number of languages. Even more constraints would still allow
us to use Wikidata items for tagging and structuring Commons in a
language-independent way (this was suggested by Erik earlier).

Current machine translation research aims at using massive machine learning
supported systems. They usually require big parallel corpora. We do not
have big parallel corpora (Wikipedia articles are not translations of each
other, in general), especially not for many languages, and there is no
reason to believe this is going to change. I would question if we want to
build an infrastructure for gathering those corpora from the Web
continuously. I do not think we can compete in this arena, or that is the
best use of our resources to support projects in this area. We should use
our unique features to our advantage.

How can we use the unique features of the Wikimedia movement to our
advantage? What are our unique features? Well, obviously, the awesome
community we are. Our technology, as amazing as it is, running our Websites
on the given budget, is nevertheless not what makes us what we are. Most
processes on the Wikimedia projects are developed in the community space,
and not implemented in bits. To summon Lessing, if code is law, Wikimedia
projects are really good in creating a space that allows for a community to
live in this space and have the freedom to create their own ecosystem.

One idea I have been mulling over for years is basically how can we use
this advantage for the task of creating content available in many
languages. Wikidata is an obvious attempt at that, but it really goes only
so far. The system I am really aiming at is a different one, and there has
been plenty of related work in this direction: imagine a wiki where you
enter or edit content, sentence by sentence, but the natural language
representation is just a surface syntax for an internal structure. Your
editing interface is a constrained, but natural language. Now, in order to
really make this fly, both the rules for the parsers (interpreting the
input) and the serializer (creating the output) would need to be editable
by the community - in addition to the content itself. There are a number of
major challenges involved, but I have by now a fair idea of how to tackle
most of them (and I don't have the time to detail them right now). Wikidata
had some design decision inside it that are already geared towards enabling
the solution for some of the problems for this kind of wiki. Whatever a
structured Wiktionary would look like, it should also be aligned with the
requirements of the project outlined here. Basically, we take constrain b,
but make it possible to push the constraint further and further through the
community - that's how we could scale on this task.

This would be far away from solving the problem of automatic translation of
text, and even further away from understanding text. But given where we are
and the resources we have available, I think it would be a more feasible
path towards achieving the mission of the Wikimedia movement than tackling
the problem of general machine learning.

In summary, I see four calls for action right now (and for all of them this
means to first actually think more and write down a project plan and gather
input on that), that could and should be

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Mark

On 4/24/13 8:29 AM, Erik Moeller wrote:

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.



I do think this is strategically relevant to Wikimedia. But there is 
already significant financial backing attempting to kickstart 
open-source MT, with some results. The goal is strategically relevant to 
another, much larger organization: the European Union. From 2006 through 
2012 they allocated about $10m to kickstart open-source MT, though 
focused primarily on European languages, via the EuroMatrix (2006-09) 
and EuroMatrixPlus (2009-12) research projects. One of the concrete 
results [1] of those projects was Moses, which I believe is currently 
the most actively developed open-source MT system. 
http://www.statmt.org/moses/


In light of that, I would suggest trying to see if we can adapt or join 
those efforts, rather than starting a new project or organization. One 
strategy could be to: 1) fund internal Wikimedia work to see if Moses 
can already be used for our purposes; and 2) fund improvements in cases 
where it isn't good enough yet (whether this is best done through grants 
to academic researchers, payments to contractors, hiring internal staff, 
or posting open bounties for implementing features, I haven't thought 
much about).


Best,
Mark

[1] They have a nice list of other software and data coming out of the 
project as well: http://www.euromatrixplus.net/resources/


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Federico Leva (Nemo)

Erik Moeller, 24/04/2013 10:06:

[...] Moreover, the lens of project/domain name is a very arbitrary one to
define vertically focused efforts.


A good and interesting reasoning here. Indeed something to keep in mind, 
but which adds problems.



There are specialized efforts
within Wikipedia that have more scale today than some of our sister
projects do, such as individual WikiProjects. There are efforts like
the partnerships with cultural institutions which have led to hundreds
of thousands of images being made available under a free license. Yet
I don't see you complaining about lack of support for GLAM tooling, or
WikiProject support (both of which are needed).


You're perhaps right about MZ, but surely GLAM tooling is something 
often asked; however it arguably falls under Commons development.
I've no idea of what WikiProject support you have in mind, and surely 
WikiProjects are too often dangerous factions to be disbanded rather 
than encouraged, but we may agree in principle.



Why should English
Wikinews with 15 active editors demand more collective attention than
any other specialized efforts?

Historically, we've drawn that project/domain name dividing line
because starting a new wiki was the best way to put a flag in the
ground and say "We will solve problem X". And we didn't know which
efforts would immediately succeed and which ones wouldn't. But in the
year 2013, you could just as well argue that instead of slapping the
Wikispecies logo on the frontpage of Wikipedia, we should make more
prominent mention of "How to contribute video on Wikipedia" or "Work
with your local museum" or "Become a campus ambassador" or any other
specialized effort which has shown promise but could use that extra
visibility.


Again, "how to contribute video" is just Commons promotion, work with 
museums is usually either Commons or Wikipedia (sometimes Wikisource), 
campus ambassadors are a program to improve some articles on some 
Wikipedias.
What I mean to say is those are means rather than goals; you're not 
disagreeing with MZ that we shouldn't expand our goals further.



The idea that just because user X proposed project Y
sometime back in the early years of Wikimedia, effort Y must forever
be part of a first order prioritization lens, is not rationally
defensible.

So, even when our goal isn't simply to make general site improvements
that benefit everyone but to support specialized new forms of content
or collaboration, I wouldn't use project/domain name division as a
tool for assessing impact, but rather frame it in terms of "What
problem is being solved here? Who is going to be reached? How many
people will be impacted"? And sometimes that does translate well to
lens of a single domain name level project, and sometimes it doesn't.


There's a general trend currently within the Wikimedia Foundation to
"narrow focus," which includes shelling out third-party MediaWiki release
support to an outside contractor or group, because there are apparently
not enough resources within the Wikimedia Foundation's 160-plus staff to
support the Wikimedia software platform for anyone other than Wikimedia.


It's not a question whether we have enough resources to support it,
but how to best put a financial boundary around third party
engagement, while also actually enabling third parties to play an
important role in the process as well (including potentially chipping
in financial support).


In light of this, it seems even more unreasonable and against good sense
to pursue a new machine translation endeavor, virtuous as it may be.


To be clear: I was not proposing that WMF should undertake such an
effort directly. But I do think that if there are ways to support an
effort that has a reasonable probability of success, with a reasonable
structure of accountability around such an engagement, it's worth
assessing. And again, that position is entirely consistent with my
view that WMF should primarily invest in technologies with broad
horizontal impact (which open source MT could have) rather than
narrower, vertical impact.


In other words we wouldn't be adding another goal alongside those of 
creating an encyclopedia, a media repository, a dictionary, a dictionary 
of quotations etc. etc. but only a tool to the extent needed by one or 
more of them?
Currently the only projects using machine translation or translation 
memory are our backstage wikis, the MediaWiki interface translation and 
some highly controversial article creation drives on a handful small 
wikis (did they continue in the last couple years?). Many ways exist to 
expand the scope of such a tool and the corpus we could provide to it, 
but the rationale of your proposal is currently a bit lacking and needs 
some work, just this.


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Mathias Schindler
On Wed, Apr 24, 2013 at 8:29 AM, Erik Moeller  wrote:


> Could open source MT be such a strategic investment? I don't know, but
> I'd like to at least raise the question. I think the alternative will
> be, for the foreseeable future, to accept that this piece of
> technology will be proprietary, and to rely on goodwill for any
> integration that concerns Wikimedia. Not the worst outcome, but also
> not the best one.
>
> Are there open source MT efforts that are close enough to merit
> scrutiny?

http://www.statmt.org/moses/ is live an kicking. Someone with a
background in computer linguistics should have a close look at them.

I would like to mention however that there are a couple of cases in
which commercial companies could be convinced to open source some of
their software, for example Mozilla. Google has open sourced terract
for OCR. Google might see the value of their translation efforts in
more than just the software but also in the actual integration in some
of their products (Gmail, Goggles, Glass) so that open sourcing it
would not hurt their financial interests. It appears to me that the
cost vs. the potential gain for everyone of simply asking a company
like Google or Microsoft if they are willing to negotiate.

In any case, I would love to see WMF engage in the topic of machine translation.

Mathias

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Erik Moeller
On Wed, Apr 24, 2013 at 12:06 AM, MZMcBride  wrote:

> Though the Wikimedia community seems eager to add new projects (Wikidata,
> Wikivoyage), I wonder how it can be sensible or reasonable to focus on yet
> another project when the current projects are largely neglected (Wikinews,
> Wikisource, Wikiversity, Wikibooks, Wikiquote, Wiktionary, etc.).

I've stated before why I disagree with this characterization, and I
reject this framing. Functionality like the Visual Editor, the mobile
site improvements, Lua, and other core engineering initiatives aren't
limited in their impact to Wikipedia. The recent efforts on mobile
uploading are actually focused on Commons. Deploying new software
every two weeks and continually making key usability improvements is
not what neglect looks like.

What WMF rarely does is directly focus effort on functionality that
primarily serves narrower use cases, which I think is appropriate at
this point in the history of our endeavor. My view is that such narrow
more vertically focused efforts should be enabled and supported by
creating structures like Labs where volunteers can meaningfully
prototype specialized functionality and work towards deployment on the
cluster.

Moreover, the lens of project/domain name is a very arbitrary one to
define vertically focused efforts. There are specialized efforts
within Wikipedia that have more scale today than some of our sister
projects do, such as individual WikiProjects. There are efforts like
the partnerships with cultural institutions which have led to hundreds
of thousands of images being made available under a free license. Yet
I don't see you complaining about lack of support for GLAM tooling, or
WikiProject support (both of which are needed). Why should English
Wikinews with 15 active editors demand more collective attention than
any other specialized efforts?

Historically, we've drawn that project/domain name dividing line
because starting a new wiki was the best way to put a flag in the
ground and say "We will solve problem X". And we didn't know which
efforts would immediately succeed and which ones wouldn't. But in the
year 2013, you could just as well argue that instead of slapping the
Wikispecies logo on the frontpage of Wikipedia, we should make more
prominent mention of "How to contribute video on Wikipedia" or "Work
with your local museum" or "Become a campus ambassador" or any other
specialized effort which has shown promise but could use that extra
visibility. The idea that just because user X proposed project Y
sometime back in the early years of Wikimedia, effort Y must forever
be part of a first order prioritization lens, is not rationally
defensible.

So, even when our goal isn't simply to make general site improvements
that benefit everyone but to support specialized new forms of content
or collaboration, I wouldn't use project/domain name division as a
tool for assessing impact, but rather frame it in terms of "What
problem is being solved here? Who is going to be reached? How many
people will be impacted"? And sometimes that does translate well to
lens of a single domain name level project, and sometimes it doesn't.

> There's a general trend currently within the Wikimedia Foundation to
> "narrow focus," which includes shelling out third-party MediaWiki release
> support to an outside contractor or group, because there are apparently
> not enough resources within the Wikimedia Foundation's 160-plus staff to
> support the Wikimedia software platform for anyone other than Wikimedia.

It's not a question whether we have enough resources to support it,
but how to best put a financial boundary around third party
engagement, while also actually enabling third parties to play an
important role in the process as well (including potentially chipping
in financial support).

> In light of this, it seems even more unreasonable and against good sense
> to pursue a new machine translation endeavor, virtuous as it may be.

To be clear: I was not proposing that WMF should undertake such an
effort directly. But I do think that if there are ways to support an
effort that has a reasonable probability of success, with a reasonable
structure of accountability around such an engagement, it's worth
assessing. And again, that position is entirely consistent with my
view that WMF should primarily invest in technologies with broad
horizontal impact (which open source MT could have) rather than
narrower, vertical impact.

Erik
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

Wikipedia and our other projects reach more than 500 million people every
month. The world population is estimated to be >7 billion. Still a long
way to go. Support us. Join us. Share: https://wikimediafoundation.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread sankarshan
On Wed, Apr 24, 2013 at 11:59 AM, Erik Moeller  wrote:
> Could open source MT be such a strategic investment? I don't know, but
> I'd like to at least raise the question. I think the alternative will
> be, for the foreseeable future, to accept that this piece of
> technology will be proprietary, and to rely on goodwill for any
> integration that concerns Wikimedia. Not the worst outcome, but also
> not the best one.

There is a compelling need to assess availability of training corpus
of significant breadth and depth for the languages. Most open-source
implementations of MT end up hitting this hurdle because content of
scale is not easily available. It would be appropriate to decide
whether WMF/Wikipedia is well placed to turn on a firehose like API
that would enable MT implementations to use statistical and other
methods on the existing content itself.


--
sankarshan mukhopadhyay


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Federico Leva (Nemo)

A few links:
* 2010 discussion: 
https://strategy.wikimedia.org/wiki/Proposal:Free_Translation_Memory as 
one of the 
https://strategy.wikimedia.org/wiki/List_of_things_that_need_to_be_free 
(follow links, including)
* http://www.apertium.org : was used by translatewiki.net but isn't any 
longer https://translatewiki.net/wiki/Technology
* Translate also has a translation memory (of course current use case is 
more limited)
** Example exposed to the world 

** Docs 

** All Wikimedia projects share one 

** We could join forces if more FLOSS projects used Translate 



Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread MZMcBride
Erik Moeller wrote:
>Could open source MT be such a strategic investment? I don't know, but
>I'd like to at least raise the question. I think the alternative will
>be, for the foreseeable future, to accept that this piece of
>technology will be proprietary, and to rely on goodwill for any
>integration that concerns Wikimedia. Not the worst outcome, but also
>not the best one.
>
>Are there open source MT efforts that are close enough to merit
>scrutiny? In order to be able to provide high quality result, you
>would need not only a motivated, well-intentioned group of people, but
>some of the smartest people in the field working on it.  I doubt we
>could more than kickstart an effort, but perhaps financial backing at
>significant scale could at least help a non-profit, open source effort
>to develop enough critical mass to go somewhere.

[...]

>Wikipedia and our other projects reach more than 500 million people every
>month. The world population is estimated to be >7 billion. Still a long
>way to go. Support us. Join us. Share: https://wikimediafoundation.org/

Putting aside the worrying focus on questionable metrics, the first part
of your new e-mail footer "Wikipedia and our other projects" seems to
hint at the underlying issue here: Wikimedia already operates a number of
projects (about a dozen), but truly supports only one (Wikipedia). Though
the Wikimedia community seems eager to add new projects (Wikidata,
Wikivoyage), I wonder how it can be sensible or reasonable to focus on yet
another project when the current projects are largely neglected (Wikinews,
Wikisource, Wikiversity, Wikibooks, Wikiquote, Wiktionary, etc.).

There's a general trend currently within the Wikimedia Foundation to
"narrow focus," which includes shelling out third-party MediaWiki release
support to an outside contractor or group, because there are apparently
not enough resources within the Wikimedia Foundation's 160-plus staff to
support the Wikimedia software platform for anyone other than Wikimedia.

In light of this, it seems even more unreasonable and against good sense
to pursue a new machine translation endeavor, virtuous as it may be. If
an outside organization wants Wikimedia's help and support and their
values align with ours, it's certainly something to explore. Otherwise,
surely we have enough projects in need of support already.

MZMcBride



___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread George Herbert
I agree.  This is a timely observation about a major problem which directly 
affects the Foundation's core goals.

I am unsure how far an effort can go today given the state of the art and 
science, but I think that this is entirely appropriate to think about and 
investigate and perhaps either fund or bring attention to, perhaps both.


George William Herbert

On Apr 23, 2013, at 11:39 PM, Ting Chen  wrote:

> Oh yes, this would really be great. Just think about the money the Foundation 
> gives out meanwhile for translation, plus the many many volunteers' work 
> invested into translation. A free and open translation software is long 
> overdue indeed. Great idea Erik.
> 
> Greetings
> Ting
> 
> Am 4/24/2013 8:29 AM, schrieb Erik Moeller:
>> Wikimedia's mission is to make the sum of all knowledge available to
>> every person on the planet. We do this by enabling communities in all
>> languages to organize and collect knowledge in our projects, removing
>> any barriers that we're able to remove.
>> 
>> In spite of this, there are and will always be large disparities in
>> the amount of locally created and curated knowledge available per
>> language, as is evident by simple statistical comparison (and most
>> beautifully visualized in Erik Zachte's bubble chart [1]).
>> 
>> Google, Microsoft and others have made great strides in developing
>> free-as-in-beer translation tools that can be used to translate from
>> and to many different languages. Increasingly, it is possible to at
>> least make basic sense of content in many different languages using
>> these tools. Machine translation can also serve as a starting point
>> for human translations.
>> 
>> Although free-as-in-beer for basic usage, integration can be
>> expensive. Google Translate charges $20 per 1M characters of text for
>> API usage. [2] These tools get better from users using them, but I've
>> seen little evidence of sharing of open datasets that would help the
>> field get better over time.
>> 
>> Undoubtedly, building the technology and the infrastructure for these
>> translation services is a very expensive undertaking, and it's
>> understandable that there are multiple commercial reasons that drive
>> the major players' ambitions in this space. But if we look at it from
>> the perspective of "How will billions of people learn in the coming
>> decades", it seems clear that better translation tools should at least
>> play some part in reducing knowledge disparities in different
>> languages, and that ideally, such tools should be "free-as-in-speech"
>> (since they're fundamentally related to speech itself).
>> 
>> If we imagine a world where top notch open source MT is available,
>> that would be a world where increasingly, language barriers to
>> accessing human knowledge could be reduced. True, translation is no
>> substitute for original content creation in a language -- but it could
>> at least powerfully support and enable such content creation, and
>> thereby help hundreds of millions of people. Beyond Wikimedia, high
>> quality open source MT would likely be integrated in many contexts
>> where it would do good for humanity and allow people to cross into
>> cultural and linguistic spaces they would otherwise not have access
>> to.
>> 
>> While Wikimedia is still only a medium-sized organization, it is not
>> poor. With more than 1M donors supporting our mission and a cash
>> position of $40M, we do now have a greater ability to make strategic
>> investments that further our mission, as communicated to our donors.
>> That's a serious level of trust and not to be taken lightly, either by
>> irresponsibly spending, or by ignoring our ability to do good.
>> 
>> Could open source MT be such a strategic investment? I don't know, but
>> I'd like to at least raise the question. I think the alternative will
>> be, for the foreseeable future, to accept that this piece of
>> technology will be proprietary, and to rely on goodwill for any
>> integration that concerns Wikimedia. Not the worst outcome, but also
>> not the best one.
>> 
>> Are there open source MT efforts that are close enough to merit
>> scrutiny? In order to be able to provide high quality result, you
>> would need not only a motivated, well-intentioned group of people, but
>> some of the smartest people in the field working on it.  I doubt we
>> could more than kickstart an effort, but perhaps financial backing at
>> significant scale could at least help a non-profit, open source effort
>> to develop enough critical mass to go somewhere.
>> 
>> All best,
>> Erik
>> 
>> [1] 
>> http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
>> [2] https://developers.google.com/translate/v2/pricing
>> --
>> Erik Möller
>> VP of Engineering and Product Development, Wikimedia Foundation
>> 
>> Wikipedia and our other projects reach more than 500 million people every
>> month. The world population is estimated to be >7 billion. Still a long
>> way to go. Suppor

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-23 Thread Ting Chen
Oh yes, this would really be great. Just think about the money the 
Foundation gives out meanwhile for translation, plus the many many 
volunteers' work invested into translation. A free and open translation 
software is long overdue indeed. Great idea Erik.


Greetings
Ting

Am 4/24/2013 8:29 AM, schrieb Erik Moeller:

Wikimedia's mission is to make the sum of all knowledge available to
every person on the planet. We do this by enabling communities in all
languages to organize and collect knowledge in our projects, removing
any barriers that we're able to remove.

In spite of this, there are and will always be large disparities in
the amount of locally created and curated knowledge available per
language, as is evident by simple statistical comparison (and most
beautifully visualized in Erik Zachte's bubble chart [1]).

Google, Microsoft and others have made great strides in developing
free-as-in-beer translation tools that can be used to translate from
and to many different languages. Increasingly, it is possible to at
least make basic sense of content in many different languages using
these tools. Machine translation can also serve as a starting point
for human translations.

Although free-as-in-beer for basic usage, integration can be
expensive. Google Translate charges $20 per 1M characters of text for
API usage. [2] These tools get better from users using them, but I've
seen little evidence of sharing of open datasets that would help the
field get better over time.

Undoubtedly, building the technology and the infrastructure for these
translation services is a very expensive undertaking, and it's
understandable that there are multiple commercial reasons that drive
the major players' ambitions in this space. But if we look at it from
the perspective of "How will billions of people learn in the coming
decades", it seems clear that better translation tools should at least
play some part in reducing knowledge disparities in different
languages, and that ideally, such tools should be "free-as-in-speech"
(since they're fundamentally related to speech itself).

If we imagine a world where top notch open source MT is available,
that would be a world where increasingly, language barriers to
accessing human knowledge could be reduced. True, translation is no
substitute for original content creation in a language -- but it could
at least powerfully support and enable such content creation, and
thereby help hundreds of millions of people. Beyond Wikimedia, high
quality open source MT would likely be integrated in many contexts
where it would do good for humanity and allow people to cross into
cultural and linguistic spaces they would otherwise not have access
to.

While Wikimedia is still only a medium-sized organization, it is not
poor. With more than 1M donors supporting our mission and a cash
position of $40M, we do now have a greater ability to make strategic
investments that further our mission, as communicated to our donors.
That's a serious level of trust and not to be taken lightly, either by
irresponsibly spending, or by ignoring our ability to do good.

Could open source MT be such a strategic investment? I don't know, but
I'd like to at least raise the question. I think the alternative will
be, for the foreseeable future, to accept that this piece of
technology will be proprietary, and to rely on goodwill for any
integration that concerns Wikimedia. Not the worst outcome, but also
not the best one.

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.

All best,
Erik

[1] 
http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
[2] https://developers.google.com/translate/v2/pricing
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

Wikipedia and our other projects reach more than 500 million people every
month. The world population is estimated to be >7 billion. Still a long
way to go. Support us. Join us. Share: https://wikimediafoundation.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l



___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


[Wikimedia-l] The case for supporting open source machine translation

2013-04-23 Thread Erik Moeller
Wikimedia's mission is to make the sum of all knowledge available to
every person on the planet. We do this by enabling communities in all
languages to organize and collect knowledge in our projects, removing
any barriers that we're able to remove.

In spite of this, there are and will always be large disparities in
the amount of locally created and curated knowledge available per
language, as is evident by simple statistical comparison (and most
beautifully visualized in Erik Zachte's bubble chart [1]).

Google, Microsoft and others have made great strides in developing
free-as-in-beer translation tools that can be used to translate from
and to many different languages. Increasingly, it is possible to at
least make basic sense of content in many different languages using
these tools. Machine translation can also serve as a starting point
for human translations.

Although free-as-in-beer for basic usage, integration can be
expensive. Google Translate charges $20 per 1M characters of text for
API usage. [2] These tools get better from users using them, but I've
seen little evidence of sharing of open datasets that would help the
field get better over time.

Undoubtedly, building the technology and the infrastructure for these
translation services is a very expensive undertaking, and it's
understandable that there are multiple commercial reasons that drive
the major players' ambitions in this space. But if we look at it from
the perspective of "How will billions of people learn in the coming
decades", it seems clear that better translation tools should at least
play some part in reducing knowledge disparities in different
languages, and that ideally, such tools should be "free-as-in-speech"
(since they're fundamentally related to speech itself).

If we imagine a world where top notch open source MT is available,
that would be a world where increasingly, language barriers to
accessing human knowledge could be reduced. True, translation is no
substitute for original content creation in a language -- but it could
at least powerfully support and enable such content creation, and
thereby help hundreds of millions of people. Beyond Wikimedia, high
quality open source MT would likely be integrated in many contexts
where it would do good for humanity and allow people to cross into
cultural and linguistic spaces they would otherwise not have access
to.

While Wikimedia is still only a medium-sized organization, it is not
poor. With more than 1M donors supporting our mission and a cash
position of $40M, we do now have a greater ability to make strategic
investments that further our mission, as communicated to our donors.
That's a serious level of trust and not to be taken lightly, either by
irresponsibly spending, or by ignoring our ability to do good.

Could open source MT be such a strategic investment? I don't know, but
I'd like to at least raise the question. I think the alternative will
be, for the foreseeable future, to accept that this piece of
technology will be proprietary, and to rely on goodwill for any
integration that concerns Wikimedia. Not the worst outcome, but also
not the best one.

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.

All best,
Erik

[1] 
http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
[2] https://developers.google.com/translate/v2/pricing
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

Wikipedia and our other projects reach more than 500 million people every
month. The world population is estimated to be >7 billion. Still a long
way to go. Support us. Join us. Share: https://wikimediafoundation.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l