Re: [Wikimedia-l] The case for supporting open source machine translation

2013-05-22 Thread Federico Leva (Nemo)

Erik Moeller, 24/04/2013 08:29:

[...]
Could open source MT be such a strategic investment? I don't know, but
I'd like to at least raise the question. I think the alternative will
be, for the foreseeable future, to accept that this piece of
technology will be proprietary, and to rely on goodwill for any
integration that concerns Wikimedia. Not the worst outcome, but also
not the best one.

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.


Some info on state of the art: 
http://laxstrom.name/blag/2013/05/22/on-course-to-machine-translation/


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-30 Thread Chris Tophe
2013/4/29 Mathieu Stumpf psychosl...@culture-libre.org

 Le 2013-04-26 20:27, Milos Rancic a écrit :

 OmegaWiki is a masterpiece from the perspective of one [computational]
 linguist. Erik made the structure so well, that it's the best starting
 point to create a contemporary multilingual dictionary. I didn't see
 anything better in concept. (And, yes, when I was thinking about
 creating such software by my own, I was always at the dead end of
 but, OmegaWiki is already that.)


 Where can I find documentation about this structure, please ?



Here (planned structure):
http://meta.wikimedia.org/wiki/OmegaWiki_data_design

and also there (current structure):
http://www.omegawiki.org/Help:OmegaWiki_database_layout

And a gentle reminder that comments are requested ;-)
http://meta.wikimedia.org/wiki/Requests_for_comment/Adopt_OmegaWiki
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-29 Thread Nikola Smolenski

On 26/04/13 19:38, Bjoern Hoehrmann wrote:

* Andrea Zanni wrote:

At the moment, Wikisource could be a interesting corpora and laboratory for
improving and enhancing OCR,
as the OCR generated text is always proofread and corrected by humans.


Try also Distributed Proofreaders. It is my impression that Wikisource's 
proofreading standards are not always up to par.



As part of our project (
http://wikisource.org/wiki/Wikisource_vision_development), Micru was
looking for a GSoC candidate for studing the reinsertion of proofread text
into djvus [1], but at the moment didn't find any interested student. We
have some contacts with people at Google working on Tesseract, and they
were available for mentoring.



[1] We thought about this both for OCR enhancement purposes and files
updating on Commons and Internet Archive (which is off topic here).


I built various tools that could be fairly easily adapted for this, my
http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr
notes are available. One of the tools for instance is a diff tool, see
image at http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031.


This is a very interesting approach :)

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-29 Thread Mathieu Stumpf

Le 2013-04-26 17:00, Gerard Meijssen a écrit :

Hoi,
When we invest in MT it is to convey knowledge, information and 
primarily
Wikipedia articles. They do not have the same problems poetry has. 
With
explanatory articles on a subject there is a web of associated 
concepts.
These concepts are likely to occur in any language if the subject 
exists in

that other language.

Consequently MT can work for Wikipedia and provide quite a solid
interpretation of what the article is about. This is helped when the
associated concepts are recognised as such and when the translations 
for

these concepts are used in the MT.
Thanks,
  GerardM


I think that poetry just make a easy to grab example of the more 
general problem of lexical/meaning intrication which will appears at 
some point. Different cultures will have different conceptualizations of 
what one may perceive. So this is not just a matter of concept sets, but 
rather of concept network dynamics, how concepts interacts within a 
world representation. And interaction means combinatorial problems, 
which require paramount ressources.


Those said, I agree that having MT to help adapt articles from one 
language/culture to an other one would be useful.





On 26 April 2013 10:38, Mathieu Stumpf 
psychosl...@culture-libre.orgwrote:



Le 2013-04-25 20:56, Theo10011 a écrit :

 As far as Linguistic typology goes, it's far too unique and too 
varied to

have a language independent form develop as easily. Perhaps it also
depends
on the perspective. For example, the majority of people commenting 
here

(Americans, Europeans) might have exposure to a limited set of a
linguistic
branch. Machine-translations as someone pointed out, are still not
preferred in some languages, even with years of research and 
potentially
unlimited resources at Google's disposal, they still come out 
sounding
clunky in some ways. And perhaps they will never get to the level 
of

absolute, where they are truly language independent.



To my mind, there's no such thing as absolute meaning. It's all 
about
intrepretation in a given a context by a given interpreter. I mean, 
I do
think that MT could probably be as good as a profesional 
translators. But
even profesional translators can't make perfect translations. I 
already
gave the example of poetry, but you may also take example of humour, 
which
ask for some cultural background, otherwise you have to explain why 
it's

funny and you know that you have to explain a joke, it's not a joke.


 If you read some of
the discussions in linguistic relativity (Sapir-Whorf hypothesis), 
there

is
research to suggest that a language a person is born with dictates 
their

thought processes and their view of the world - there might not be
absolutes when it comes to linguistic cognition. There is something
inherently unique in the cognitive patterns of different languages.



That's just how learning process work, you can't understand 
something
you didn't experiment. Reading an algorithm won't give you the 
insight
you'll get when you process it mentaly (with the help of pencil and 
paper)
and a textual description of making love won't provide you the 
feeling it

provide.



 Which brings me to the point, why not English? Your idea seems 
plausible

enough even if your remove the abstract idea of complete language
universality, without venturing into the science-fiction labyrinth 
of

man-machine collaboration.



English have many so called non-neutral problems. As far as I 
know, if
the goal is to use syntactically unambiguous human language, lojban 
is the
best current candidate. English as an international language is a 
very
harmful situation. Believe it or not, but I sometime have to 
translate to
English sentences which are written in French, because the writer 
was
thinking with English idiomatic locution that he poorly translated 
to

French, its native language in which it doesn't know the idiomatic
locution. Even worst, I red people which where where using concepts 
with an
English locution because they never matched it with the French 
locution
that they know. And in the other way, I'm not sure that having 
millions of
people speaking a broken English is a wonderful situation for this 
language.


Search why not english as international language if you need more
documentation.


--
Association Culture-Libre
http://www.culture-libre.org/

__**_
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.**org Wikimedia-l@lists.wikimedia.org
Unsubscribe: 
https://lists.wikimedia.org/**mailman/listinfo/wikimedia-lhttps://lists.wikimedia.org/mailman/listinfo/wikimedia-l



___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-29 Thread Mathieu Stumpf

Le 2013-04-26 19:57, Samuel Klein a écrit :
On Fri, Apr 26, 2013 at 1:24 PM, Bjoern Hoehrmann derhoe...@gmx.net 
wrote:

* Erik Moeller wrote:

Are there open source MT efforts that are close enough to merit
scrutiny?


Wiktionary. If you want to help free software efforts in the area of
machine translation, then what they seem to need most is high 
quality

data about words, word forms, and so on, in a readily machine-usable
form, and freely licensed.


Yes.  Finding a way to capture and integrate the work OmegaWiki has
done into a new Wikidata-powered Wiktionary would be a useful start.
And we've already sort of claimed the space (though we are neglecting
it) -- it's discouraging to anyone else who might otherwise try to
build a brilliant free structured dictionary that we are *so close* 
to

getting it right.



If you have suggestions on Wiktionnaries future, please consider to 
share them on https://meta.wikimedia.org/wiki/Wiktionary_future


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-29 Thread Mathieu Stumpf

Le 2013-04-26 20:27, Milos Rancic a écrit :
On Fri, Apr 26, 2013 at 7:57 PM, Samuel Klein meta...@gmail.com 
wrote:

Yes.  Finding a way to capture and integrate the work OmegaWiki has
done into a new Wikidata-powered Wiktionary would be a useful start.
And we've already sort of claimed the space (though we are 
neglecting

it) -- it's discouraging to anyone else who might otherwise try to
build a brilliant free structured dictionary that we are *so close* 
to

getting it right.


OmegaWiki is a masterpiece from the perspective of one 
[computational]
linguist. Erik made the structure so well, that it's the best 
starting

point to create a contemporary multilingual dictionary. I didn't see
anything better in concept. (And, yes, when I was thinking about
creating such software by my own, I was always at the dead end of
but, OmegaWiki is already that.)


Where can I find documentation about this structure, please ?


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-27 Thread Ryu Cheol
Thanks to Jane for introducing CoSyne. But I feel all the wikis do not want to 
be synchronized to certain wikis. Rather than having identical articles, I hope 
they would have their own articles. I hope I could have two more tabs at right 
of the 'Article' and 'Talk' on English Wikipedia for Korean language. The two 
tabs are 'Article in Korean' and 'Talk in Korean'. The translations would have 
same information in originals and any editing on an article or a talk in 
translation pages would go back to the originals. In this case they need to be 
synchronized precisely.

I mean these are done in the scope of English Wikipedia, not related to Korean 
Wikipedia. But the Korean Wikipedia linked to the left side of a page would be 
benefited from the translations in English Wikipedia eventually when an Korean 
Wikipedia editor find a good part of English Wikipedia article could be 
inserted to Korean Wikipedia.

You can find the merits of the exact Korean translation of English Wikipedia or 
the scheme of the exact translation of big Wikipedias. It will help you reach 
to more potential contributors. It will make the language barrier lower for 
those who want to contribute to a Wikipedia they do not speak very well. Also, 
It could provide the better aligned corpora and it could could track how human 
translators or reviewers improve the translations. 

Cheol

On 2013. 4. 26., at 오후 9:04, Jane Darnell jane...@gmail.com wrote:

 We already have the translation options on the left side of the screen
 in any Wikipedia article.
 This choice is generally a smattering of languages, and a long term
 goal for many small-language Wikipedias is to be able to translate an
 article from related languages (say from Dutch into Frisian, where the
 Frisian Wikipedia has no article at all on the title subject) and the
 even longer-term goal is to translate into some other
 really-really-really foreign language.
 
 Wouldn't it be easier however, to start with a project that uses
 translatewiki and the related-language pairs? Usually there is a big
 difference in numbers of articles (like between the Dutch Wikipedia
 and the Frisian Wikipedia). Presumably the demand is larger on the
 destination wikipedia (because there are fewer articles in those
 languages), and the potential number of human translators is larger
 (because most editors active in the smaller Wikipedia are versed in
 both langages).
 
 The Dutch Wikimedia chapter took part in a European multilingual
 synchronization tool project called CoSyne:
 http://cosyne.eu/index.php/Main_Page
 
 It was not a success, because it was hard to figure out how this would
 be beneficial to Wikipedians actually joining the project. Some
 funding that was granted to the chapter to work on the project will be
 returned, because it was never spent.
 
 In order to tackle this problem on a large scale, it needs to be
 broken down into words, sentences, paragraphs and perhaps other
 structures (category trees?). I think CoSyne was trying to do this. I
 think it would be easier to keep the effort in one-way-traffic, so try
 to offer machine translation from Dutch to Frisian and not the other
 way around, and then as you go, define concepts that work both ways,
 so that eventually it would be possible to translated from Frisian
 into Dutch.
 
 2013/4/26, Mathieu Stumpf psychosl...@culture-libre.org:
 Le 2013-04-25 20:56, Theo10011 a écrit :
 As far as Linguistic typology goes, it's far too unique and too
 varied to
 have a language independent form develop as easily. Perhaps it also
 depends
 on the perspective. For example, the majority of people commenting
 here
 (Americans, Europeans) might have exposure to a limited set of a
 linguistic
 branch. Machine-translations as someone pointed out, are still not
 preferred in some languages, even with years of research and
 potentially
 unlimited resources at Google's disposal, they still come out
 sounding
 clunky in some ways. And perhaps they will never get to the level of
 absolute, where they are truly language independent.
 
 To my mind, there's no such thing as absolute meaning. It's all about
 intrepretation in a given a context by a given interpreter. I mean, I do
 think that MT could probably be as good as a profesional translators.
 But even profesional translators can't make perfect translations. I
 already gave the example of poetry, but you may also take example of
 humour, which ask for some cultural background, otherwise you have to
 explain why it's funny and you know that you have to explain a joke,
 it's not a joke.
 
 If you read some of
 the discussions in linguistic relativity (Sapir-Whorf hypothesis),
 there is
 research to suggest that a language a person is born with dictates
 their
 thought processes and their view of the world - there might not be
 absolutes when it comes to linguistic cognition. There is something
 inherently unique in the cognitive patterns of different languages.
 
 That's just how learning 

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Shlomi Fish
Hi all,

On Wed, 24 Apr 2013 08:39:55 +0200
Ting Chen wing.phil...@gmx.de wrote:

 Oh yes, this would really be great. Just think about the money the 
 Foundation gives out meanwhile for translation, plus the many many 
 volunteers' work invested into translation. A free and open translation 
 software is long overdue indeed. Great idea Erik.
 

unfortunately, I don't think we can expect that with the current state of the
art that a machine translation would do as good a job as a human translator,
so don't hold your hopes up for that. For example if we translate
http://shlomif.livejournal.com/63847.html to English with Google Translate we
get:
http://translate.google.com/translate?sl=iwtl=enjs=nprev=_thl=enie=UTF-8eotf=1u=http%3A%2F%2Fshlomif.livejournal.com%2F63847.htmlact=url


Yotam and hifh own and the Geek 

I have been offered several times to participate Bhifh and the Geek and I
refused. Those who have forgotten, this is what is said in the Bible parable of
Jotham :

And they told Jotham, he went and stood on a mountain top - Gerizim, and
lifted up his voice and called; And said to them - they heard me Shechem, and
God will hear you:

The trees went forth anointed king over them.
And they said olive Malka us!
Olive said unto them: I stopped the - fertilizers, which - I will honor God
and man - And go to the - the trees!
And the trees said to the fig: Go - the Kings of us!
The fig tree said unto them: I stopped the - sweetness, and - good yield -
And go to the - the trees!
And the trees said to the vine: Go - the Kings of us!
Vine said unto them: I stopped the - Tirosh, auspicious God and man -
And go to the - the trees!
And tell all - the trees to the - bramble: You're the king - on us!
And bramble said to the - trees: If in truth ye anoint me king over you -
come and take refuge in my shade; If - no - let fire come out - the bramble,
and devour the - cedars of Lebanon! 


Sounds incredibly awkward and the main text was taken from
http://www.heraldmag.org/literature/doc_12.htm .

So it hardly makes a good job and we cannot expect it to replace human
translations.

Regards,

Shlomi Fish

-- 
-
Shlomi Fish   http://www.shlomifish.org/
http://www.shlomifish.org/humour/Summerschool-at-the-NSA/

I don’t believe in fairies. Oops! A fairy died.
I don’t believe in fairies. Oops! Another fairy died.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Mathieu Stumpf

Le 2013-04-25 20:56, Theo10011 a écrit :
As far as Linguistic typology goes, it's far too unique and too 
varied to
have a language independent form develop as easily. Perhaps it also 
depends
on the perspective. For example, the majority of people commenting 
here
(Americans, Europeans) might have exposure to a limited set of a 
linguistic

branch. Machine-translations as someone pointed out, are still not
preferred in some languages, even with years of research and 
potentially
unlimited resources at Google's disposal, they still come out 
sounding

clunky in some ways. And perhaps they will never get to the level of
absolute, where they are truly language independent.


To my mind, there's no such thing as absolute meaning. It's all about 
intrepretation in a given a context by a given interpreter. I mean, I do 
think that MT could probably be as good as a profesional translators. 
But even profesional translators can't make perfect translations. I 
already gave the example of poetry, but you may also take example of 
humour, which ask for some cultural background, otherwise you have to 
explain why it's funny and you know that you have to explain a joke, 
it's not a joke.



If you read some of
the discussions in linguistic relativity (Sapir-Whorf hypothesis), 
there is
research to suggest that a language a person is born with dictates 
their

thought processes and their view of the world - there might not be
absolutes when it comes to linguistic cognition. There is something
inherently unique in the cognitive patterns of different languages.


That's just how learning process work, you can't understand something 
you didn't experiment. Reading an algorithm won't give you the insight 
you'll get when you process it mentaly (with the help of pencil and 
paper) and a textual description of making love won't provide you the 
feeling it provide.



Which brings me to the point, why not English? Your idea seems 
plausible

enough even if your remove the abstract idea of complete language
universality, without venturing into the science-fiction labyrinth of
man-machine collaboration.


English have many so called non-neutral problems. As far as I know, 
if the goal is to use syntactically unambiguous human language, lojban 
is the best current candidate. English as an international language is a 
very harmful situation. Believe it or not, but I sometime have to 
translate to English sentences which are written in French, because the 
writer was thinking with English idiomatic locution that he poorly 
translated to French, its native language in which it doesn't know the 
idiomatic locution. Even worst, I red people which where where using 
concepts with an English locution because they never matched it with the 
French locution that they know. And in the other way, I'm not sure that 
having millions of people speaking a broken English is a wonderful 
situation for this language.


Search why not english as international language if you need more 
documentation.


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Jane Darnell
We already have the translation options on the left side of the screen
in any Wikipedia article.
This choice is generally a smattering of languages, and a long term
goal for many small-language Wikipedias is to be able to translate an
article from related languages (say from Dutch into Frisian, where the
Frisian Wikipedia has no article at all on the title subject) and the
even longer-term goal is to translate into some other
really-really-really foreign language.

Wouldn't it be easier however, to start with a project that uses
translatewiki and the related-language pairs? Usually there is a big
difference in numbers of articles (like between the Dutch Wikipedia
and the Frisian Wikipedia). Presumably the demand is larger on the
destination wikipedia (because there are fewer articles in those
languages), and the potential number of human translators is larger
(because most editors active in the smaller Wikipedia are versed in
both langages).

The Dutch Wikimedia chapter took part in a European multilingual
synchronization tool project called CoSyne:
http://cosyne.eu/index.php/Main_Page

It was not a success, because it was hard to figure out how this would
be beneficial to Wikipedians actually joining the project. Some
funding that was granted to the chapter to work on the project will be
returned, because it was never spent.

In order to tackle this problem on a large scale, it needs to be
broken down into words, sentences, paragraphs and perhaps other
structures (category trees?). I think CoSyne was trying to do this. I
think it would be easier to keep the effort in one-way-traffic, so try
to offer machine translation from Dutch to Frisian and not the other
way around, and then as you go, define concepts that work both ways,
so that eventually it would be possible to translated from Frisian
into Dutch.

2013/4/26, Mathieu Stumpf psychosl...@culture-libre.org:
 Le 2013-04-25 20:56, Theo10011 a écrit :
 As far as Linguistic typology goes, it's far too unique and too
 varied to
 have a language independent form develop as easily. Perhaps it also
 depends
 on the perspective. For example, the majority of people commenting
 here
 (Americans, Europeans) might have exposure to a limited set of a
 linguistic
 branch. Machine-translations as someone pointed out, are still not
 preferred in some languages, even with years of research and
 potentially
 unlimited resources at Google's disposal, they still come out
 sounding
 clunky in some ways. And perhaps they will never get to the level of
 absolute, where they are truly language independent.

 To my mind, there's no such thing as absolute meaning. It's all about
 intrepretation in a given a context by a given interpreter. I mean, I do
 think that MT could probably be as good as a profesional translators.
 But even profesional translators can't make perfect translations. I
 already gave the example of poetry, but you may also take example of
 humour, which ask for some cultural background, otherwise you have to
 explain why it's funny and you know that you have to explain a joke,
 it's not a joke.

 If you read some of
 the discussions in linguistic relativity (Sapir-Whorf hypothesis),
 there is
 research to suggest that a language a person is born with dictates
 their
 thought processes and their view of the world - there might not be
 absolutes when it comes to linguistic cognition. There is something
 inherently unique in the cognitive patterns of different languages.

 That's just how learning process work, you can't understand something
 you didn't experiment. Reading an algorithm won't give you the insight
 you'll get when you process it mentaly (with the help of pencil and
 paper) and a textual description of making love won't provide you the
 feeling it provide.


 Which brings me to the point, why not English? Your idea seems
 plausible
 enough even if your remove the abstract idea of complete language
 universality, without venturing into the science-fiction labyrinth of
 man-machine collaboration.

 English have many so called non-neutral problems. As far as I know,
 if the goal is to use syntactically unambiguous human language, lojban
 is the best current candidate. English as an international language is a
 very harmful situation. Believe it or not, but I sometime have to
 translate to English sentences which are written in French, because the
 writer was thinking with English idiomatic locution that he poorly
 translated to French, its native language in which it doesn't know the
 idiomatic locution. Even worst, I red people which where where using
 concepts with an English locution because they never matched it with the
 French locution that they know. And in the other way, I'm not sure that
 having millions of people speaking a broken English is a wonderful
 situation for this language.

 Search why not english as international language if you need more
 documentation.

 --
 Association Culture-Libre
 http://www.culture-libre.org/

 

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Bjoern Hoehrmann
* Andrea Zanni wrote:
At the moment, Wikisource could be a interesting corpora and laboratory for
improving and enhancing OCR,
as the OCR generated text is always proofread and corrected by humans.
As part of our project (
http://wikisource.org/wiki/Wikisource_vision_development), Micru was
looking for a GSoC candidate for studing the reinsertion of proofread text
into djvus [1], but at the moment didn't find any interested student. We
have some contacts with people at Google working on Tesseract, and they
were available for mentoring.

[1] We thought about this both for OCR enhancement purposes and files
updating on Commons and Internet Archive (which is off topic here).

I built various tools that could be fairly easily adapted for this, my
http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr
notes are available. One of the tools for instance is a diff tool, see
image at http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Samuel Klein
On Fri, Apr 26, 2013 at 1:24 PM, Bjoern Hoehrmann derhoe...@gmx.net wrote:
 * Erik Moeller wrote:
Are there open source MT efforts that are close enough to merit
scrutiny?

 Wiktionary. If you want to help free software efforts in the area of
 machine translation, then what they seem to need most is high quality
 data about words, word forms, and so on, in a readily machine-usable
 form, and freely licensed.

Yes.  Finding a way to capture and integrate the work OmegaWiki has
done into a new Wikidata-powered Wiktionary would be a useful start.
And we've already sort of claimed the space (though we are neglecting
it) -- it's discouraging to anyone else who might otherwise try to
build a brilliant free structured dictionary that we are *so close* to
getting it right.

   [ Andrea's ideas about using Wikisource to improve OCR tools ]

 I built various tools that could be fairly easily adapted for this, my
 http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr
 notes are available. One of the tools for instance is a diff tool, see
 image at http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031.

I hope the related GSOC project gets support.  Getting mentoring from
Tesseract team members seems like a handy way to keep the projects
connected.


Tim Starling writes:
 We could basically clone the frontend component of Google Translate,
 and use Moses as a backend. The work would be mostly JavaScript...
 the next job would be to develop a corpus sharing site, hosting any
 available freely-licensed output of the frontend tool.

This would be most useful.  There are often short quick translation
projects that I would like to do through this sort of TM-capturing
interface; for which the translatewiki prep-process is rather time
consuming.

We can set up a corpus sharing site now, with translatewiki - there is
already a lot of material there that could be part of it.  Different
corpora (say, encyclopedic articles v. dictionary pages v. quotes)
would need to be tagged for context.  And we could start letting
people upload their own freely licensed corpora to include as well.
We would probably want a vetting process before giving users the
import tool; or a quarantine until we had better ways to let editors
revert / bulk-modify entire imports.

SJ

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Milos Rancic
On Thu, Apr 25, 2013 at 4:26 PM, Denny Vrandečić
denny.vrande...@wikimedia.de wrote:
 Not just bootstrapping the content. By having the primary content be saved
 in a language independent form, and always translating it on the fly, it
 would not merely bootstrap content in different languages, but it would
 mean that editors from different languages would be working on the same
 content. The texts in the different language is not a translation of each
 other, but they are all created from the same source. There would be no
 primacy of, say, English.

What we can is to make Simple English Wikipedia more useful and
rewrite rules from the Simple English language to the Controlled
English language and to allow filling the content of the smaller
Wikipedias from Simple English Wikipedia. That's the only way how to
get anything more useful than Google Translate output.

There are serious problems in relation to the translation of
translation process and that kind of complexity is not in the range
of contemporary science. (Basically, even good machine translation is
not in in the range contemporary science. Statistical approaches are
useful for getting basic understanding, but very bad for writing
encyclopedia or anything else which requires correct output in the
targeted language.)

On a much simpler scale of conversion engines, we can see that even 1%
of errors (or manual interventions) are serious issue for the text
integrity, while translations of translations are creating much more
errors, no matter would there be human interventions or not. And
that's not acceptable for average editor of the project in targeted
language.

Said so, we'd need serious linguistic work for every language added to
the system.

At the other side, I support Erik's intention to make free software
tool for machine translation. But note that it's just the second step
(Wikidata was the first one) on the long way.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-26 Thread Milos Rancic
On Fri, Apr 26, 2013 at 7:57 PM, Samuel Klein meta...@gmail.com wrote:
 Yes.  Finding a way to capture and integrate the work OmegaWiki has
 done into a new Wikidata-powered Wiktionary would be a useful start.
 And we've already sort of claimed the space (though we are neglecting
 it) -- it's discouraging to anyone else who might otherwise try to
 build a brilliant free structured dictionary that we are *so close* to
 getting it right.

OmegaWiki is a masterpiece from the perspective of one [computational]
linguist. Erik made the structure so well, that it's the best starting
point to create a contemporary multilingual dictionary. I didn't see
anything better in concept. (And, yes, when I was thinking about
creating such software by my own, I was always at the dead end of
but, OmegaWiki is already that.)

At the other side, OmegaWiki software is from the previous decade and
it requires major fixes. And, obviously, WMF should do that.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Nikola Smolenski

On 24/04/13 12:35, Denny Vrandečić wrote:

Current machine translation research aims at using massive machine learning
supported systems. They usually require big parallel corpora. We do not
have big parallel corpora (Wikipedia articles are not translations of each
other, in general), especially not for many languages, and there is no


Could you define big? If 10% of Wikipedia articles are translations of 
each other, we have 2 million translation pairs. Assuming ten sentences 
per average article, this is 20 million sentence pairs. An average 
Wikipedia with 100,000 articles would have 10,000 translations and 
100,000 sentence pairs; a large Wikipedia with 1,000,000 articles would 
have 100,000 translations and 1,000,000 sentence pairs - is this not 
enough to kickstart a massive machine learning supported system? 
(Consider also that the articles are somewhat similar in structure and 
less rich than general text - future tense is rarely used for example.)


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Nikola Smolenski

On 24/04/13 12:35, Denny Vrandečić wrote:

In summary, I see four calls for action right now (and for all of them this
means to first actually think more and write down a project plan and gather
input on that), that could and should be tackled in parallel if possible:
I ) develop  a structured Wiktionary
II ) develop a feature that blends into Wikipedia's search if an article
about a topic does not exist yet, but we  have data on Wikidata about that
topic
III ) develop a multilingual search, tagging, and structuring environment
for Commons
IV ) develop structured Wiki content using natural language as a surface
syntax, with extensible parsers and serializers

None of these goals would require tens of millions or decades of research
and development. I think we could have an actionable plan developed within
a month or two for all four goals, and my gut feeling is we could reach
them all by 2015 or 16, depending when we actually start with implementing
them.


I fully support this, though! This is fully within Wikimedia's current 
infrastructure, and generally was planned to be done anyway.


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Erik Moeller
Denny,

very good and compelling reasoning as always. I think the argument
that we can potentially do a lot for the MT space (including open
source efforts) in part by getting our own house in order on the
dictionary side of things makes a lot of sense. I don't think it
necessarily excludes investing in open source MT efforts, but Mark
makes a good point that there are already existing institutions
pouring money into promising initiatives. Let me try to understand
some of the more complex ideas outlined in your note a bit better.

 The system I am really aiming at is a different one, and there has
 been plenty of related work in this direction: imagine a wiki where you
 enter or edit content, sentence by sentence, but the natural language
 representation is just a surface syntax for an internal structure. Your
 editing interface is a constrained, but natural language. Now, in order to
 really make this fly, both the rules for the parsers (interpreting the
 input) and the serializer (creating the output) would need to be editable
 by the community - in addition to the content itself. There are a number of
 major challenges involved, but I have by now a fair idea of how to tackle
 most of them (and I don't have the time to detail them right now).

So what would you want to enable with this? Faster bootstrapping of
content? How would it work, and how would this be superior to an
approach like the one taken in the Translate extension (basically,
providing good interfaces for 1:1 translation, tracking differences
between documents, and offering MT and translation memory based
suggestions)? Are there examples of this approach being taken
somewhere else?

Thanks,
Erik

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Mathieu Stumpf

Le 2013-04-25 04:49, George Herbert a écrit :
We can't usefully help with internet access (and that's proceeding at 
good
pace even in the third world), but language will remain a barrier 
when
people get access.  In a few situations politics / firewalling is as 
well
(China, primarily), which is another strategic challenge.  That, 
however,
is political and geopolitical, and not an easy nut for WMF to crack.  
Of
the three issues - net, firewalling, and language, one of them is 
something
we can work on.  We should think about how to work on that.  MT seems 
like

an obvious answer, but not the only possible one.


Do you have specific ideas in mind? Apart from having an international 
language and pedagogic material accessible to everyone and able to 
teach them from a zero knowledge requirement, I fail to see many 
options. Personally I'm currently learning esperanto as I would be happy 
to participate in such a process. I learn esperanto because it seems the 
current most successful language for such a project. It's already used 
on official china sites, and there's a current petition you can sign to 
make it an official european language[1].


[1] 
https://secure.avaaz.org/en/petition/Esperanto_langue_officielle_de_lUE/



--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Brion Vibber
On Thu, Apr 25, 2013 at 7:26 AM, Denny Vrandečić 
denny.vrande...@wikimedia.de wrote:

 Not just bootstrapping the content. By having the primary content be saved
 in a language independent form, and always translating it on the fly, it
 would not merely bootstrap content in different languages, but it would
 mean that editors from different languages would be working on the same
 content. The texts in the different language is not a translation of each
 other, but they are all created from the same source. There would be no
 primacy of, say, English.


You are blowing my mind, dude. :)

I suspect this approach won't serve for everything, but it sounds
*awesome*. If we can tie natural-language statements directly to data nodes
(rather than merely annotating vague references like we do today), then
we'd be much better able to keep language versions in sync. How to make
them sane to edit... sounds harder. :)

It would be foolish to create any such plan without reusing tools and
 concepts from the Translate extension, translation memories, etc. There is
 a lot of UI and conceptual goodness in these tools. The idea would be to
 make them user extensible with rules.


Heck yeah!

If you want, examples of that are the bots working on some Wikipedias
 currently, creating text from structured input. They are partially reusing
 the same structured input, and need merely a translation in the way the
 bots create the text to save in the given Wikipedia. I have seen some
 research in the area, but they all have one or the other drawbacks, but can
 and should be used as an inspiration and to inform the project (like
 Allegro Controlled English, or a Chat program developed at the Open
 University in Milton Keynes to allow conducting business in different
 languages, etc.)


Ye... make them real-time updatable instead of one-time bots producing
language which can't be maintained.

-- brion
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Denny Vrandečić
2013/4/25 Brion Vibber bvib...@wikimedia.org

 You are blowing my mind, dude. :)


Glad to do hear :)


I suspect this approach won't serve for everything, but it sounds
 *awesome*. If we can tie natural-language statements directly to data nodes
 (rather than merely annotating vague references like we do today), then
 we'd be much better able to keep language versions in sync. How to make
 them sane to edit... sounds harder. :)


Absolutely correct, it would not serve for everything. And it doesn't have
to. For an encyclopedia we should be able to get a useful amount of
frames in a decent timeframe. For song lyrics, it might take a bit longer.

It would and should start with a restricted set of possible frames, but the
trick would be to make the user extensible. Because that is where we are
good at -- users who fill and extend the frameworks we provide. I don't
know of much work where the frames and rules themselves are user editable
and extensible, but heck, they people said we are crazy when we made the
properties user editable and extensible in Semantic MediaWiki and later
Wikidata, and it seems to be working out.

A sane editing interface - both for the rules and the content, and their
interaction - would be something that would need to be explored first, just
to check whether this is indeed possible or just wishful thinking. Starting
without this kind of exploration beforehand would be a bit adventurous, or
optimistic.

Cheers,
Denny


-- 
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread Theo10011
On Thu, Apr 25, 2013 at 7:56 PM, Denny Vrandečić 
denny.vrande...@wikimedia.de wrote:

 Not just bootstrapping the content. By having the primary content be saved
 in a language independent form, and always translating it on the fly, it
 would not merely bootstrap content in different languages, but it would
 mean that editors from different languages would be working on the same
 content. The texts in the different language is not a translation of each
 other, but they are all created from the same source. There would be no
 primacy of, say, English.


This is a thought but I've never heard of a Language independent form. I
also question its importance to your core idea vs. say, a primary language
of choice. An argument can be made that language independent on a computer
medium can't exist, down to a programming language, the instructions and
even binary bits, there is a language running on top of higher inputs (even
transitioning between computer languages isn't at an absolute level)- to
that extent, I wonder if data can truly be language independent.

As far as Linguistic typology goes, it's far too unique and too varied to
have a language independent form develop as easily. Perhaps it also depends
on the perspective. For example, the majority of people commenting here
(Americans, Europeans) might have exposure to a limited set of a linguistic
branch. Machine-translations as someone pointed out, are still not
preferred in some languages, even with years of research and potentially
unlimited resources at Google's disposal, they still come out sounding
clunky in some ways. And perhaps they will never get to the level of
absolute, where they are truly language independent. If you read some of
the discussions in linguistic relativity (Sapir-Whorf hypothesis), there is
research to suggest that a language a person is born with dictates their
thought processes and their view of the world - there might not be
absolutes when it comes to linguistic cognition. There is something
inherently unique in the cognitive patterns of different languages.

Which brings me to the point, why not English? Your idea seems plausible
enough even if your remove the abstract idea of complete language
universality, without venturing into the science-fiction labyrinth of
man-machine collaboration.

Regards
Theo
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-25 Thread George Herbert
This subthread seems headed out into practical / applied epistemology, if
there is such a thing.

I am not sure if we can get from here to there; that said, a new structure
with language independent facts / information points that then got
machine-explained or described in a local language would be an interesting
structure to build an encyclopedia around.  Wikidata is a good idea but not
enough here.  I'm not sure the state of knowledge theory and practice is
good enough to do this, but I am suddenly more interested in IBM's Watson
project and some related knowledge / natural language interaction AI work...

This is very interesting, but probably less midterm-practical than machine
translation and the existing WP / other project data.


On Thu, Apr 25, 2013 at 8:46 AM, Denny Vrandečić 
denny.vrande...@wikimedia.de wrote:

 2013/4/25 Mathieu Stumpf psychosl...@culture-libre.org

  What would be the limits you would expect from your solution, because you
  can't expect to just translate everything. Form may be a part of the
  meaning. It's clear that you can't translate a poem for example. Sur
  wikipedia is not primary concerned about poetry, but it does treat the
  subject.
 
 
 I don't know where the limits would be. Probably further then we think
 right now, but yes, they still would be there and severe. The nice thing is
 that we would be collecting data about the limits constantly, and could
 thus feed the system to further improve and grow. Not automatically (I
 guess, but bots would obviously also be allowed to work on the rules as
 well), but through human intelligence, analyzing the input and trying to
 refine and extend the rules.

 But, considering the already existing bot created articles, which number in
 the hundred thousands in languages like Swedish, Dutch, or Polish, there
 seems to be some consensus that this can be considered as a useful starting
 block. It's just that with the current system, even with Wikidata, we
 cannot really grow into this direction further.

 Cheers,
 Denny

 --
 Project director Wikidata
 Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
 Tel. +49-30-219 158 26-0 | http://wikimedia.de

 Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
 Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
 der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
 Körperschaften I Berlin, Steuernummer 27/681/51985.
 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l




-- 
-george william herbert
george.herb...@gmail.com
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Ting Chen
Oh yes, this would really be great. Just think about the money the 
Foundation gives out meanwhile for translation, plus the many many 
volunteers' work invested into translation. A free and open translation 
software is long overdue indeed. Great idea Erik.


Greetings
Ting

Am 4/24/2013 8:29 AM, schrieb Erik Moeller:

Wikimedia's mission is to make the sum of all knowledge available to
every person on the planet. We do this by enabling communities in all
languages to organize and collect knowledge in our projects, removing
any barriers that we're able to remove.

In spite of this, there are and will always be large disparities in
the amount of locally created and curated knowledge available per
language, as is evident by simple statistical comparison (and most
beautifully visualized in Erik Zachte's bubble chart [1]).

Google, Microsoft and others have made great strides in developing
free-as-in-beer translation tools that can be used to translate from
and to many different languages. Increasingly, it is possible to at
least make basic sense of content in many different languages using
these tools. Machine translation can also serve as a starting point
for human translations.

Although free-as-in-beer for basic usage, integration can be
expensive. Google Translate charges $20 per 1M characters of text for
API usage. [2] These tools get better from users using them, but I've
seen little evidence of sharing of open datasets that would help the
field get better over time.

Undoubtedly, building the technology and the infrastructure for these
translation services is a very expensive undertaking, and it's
understandable that there are multiple commercial reasons that drive
the major players' ambitions in this space. But if we look at it from
the perspective of How will billions of people learn in the coming
decades, it seems clear that better translation tools should at least
play some part in reducing knowledge disparities in different
languages, and that ideally, such tools should be free-as-in-speech
(since they're fundamentally related to speech itself).

If we imagine a world where top notch open source MT is available,
that would be a world where increasingly, language barriers to
accessing human knowledge could be reduced. True, translation is no
substitute for original content creation in a language -- but it could
at least powerfully support and enable such content creation, and
thereby help hundreds of millions of people. Beyond Wikimedia, high
quality open source MT would likely be integrated in many contexts
where it would do good for humanity and allow people to cross into
cultural and linguistic spaces they would otherwise not have access
to.

While Wikimedia is still only a medium-sized organization, it is not
poor. With more than 1M donors supporting our mission and a cash
position of $40M, we do now have a greater ability to make strategic
investments that further our mission, as communicated to our donors.
That's a serious level of trust and not to be taken lightly, either by
irresponsibly spending, or by ignoring our ability to do good.

Could open source MT be such a strategic investment? I don't know, but
I'd like to at least raise the question. I think the alternative will
be, for the foreseeable future, to accept that this piece of
technology will be proprietary, and to rely on goodwill for any
integration that concerns Wikimedia. Not the worst outcome, but also
not the best one.

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.

All best,
Erik

[1] 
http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
[2] https://developers.google.com/translate/v2/pricing
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

Wikipedia and our other projects reach more than 500 million people every
month. The world population is estimated to be 7 billion. Still a long
way to go. Support us. Join us. Share: https://wikimediafoundation.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l



___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread MZMcBride
Erik Moeller wrote:
Could open source MT be such a strategic investment? I don't know, but
I'd like to at least raise the question. I think the alternative will
be, for the foreseeable future, to accept that this piece of
technology will be proprietary, and to rely on goodwill for any
integration that concerns Wikimedia. Not the worst outcome, but also
not the best one.

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.

[...]

Wikipedia and our other projects reach more than 500 million people every
month. The world population is estimated to be 7 billion. Still a long
way to go. Support us. Join us. Share: https://wikimediafoundation.org/

Putting aside the worrying focus on questionable metrics, the first part
of your new e-mail footer Wikipedia and our other projects seems to
hint at the underlying issue here: Wikimedia already operates a number of
projects (about a dozen), but truly supports only one (Wikipedia). Though
the Wikimedia community seems eager to add new projects (Wikidata,
Wikivoyage), I wonder how it can be sensible or reasonable to focus on yet
another project when the current projects are largely neglected (Wikinews,
Wikisource, Wikiversity, Wikibooks, Wikiquote, Wiktionary, etc.).

There's a general trend currently within the Wikimedia Foundation to
narrow focus, which includes shelling out third-party MediaWiki release
support to an outside contractor or group, because there are apparently
not enough resources within the Wikimedia Foundation's 160-plus staff to
support the Wikimedia software platform for anyone other than Wikimedia.

In light of this, it seems even more unreasonable and against good sense
to pursue a new machine translation endeavor, virtuous as it may be. If
an outside organization wants Wikimedia's help and support and their
values align with ours, it's certainly something to explore. Otherwise,
surely we have enough projects in need of support already.

MZMcBride



___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Federico Leva (Nemo)

A few links:
* 2010 discussion: 
https://strategy.wikimedia.org/wiki/Proposal:Free_Translation_Memory as 
one of the 
https://strategy.wikimedia.org/wiki/List_of_things_that_need_to_be_free 
(follow links, including)
* http://www.apertium.org : was used by translatewiki.net but isn't any 
longer https://translatewiki.net/wiki/Technology
* Translate also has a translation memory (of course current use case is 
more limited)
** Example exposed to the world 
http://translatewiki.net/w/api.php?action=ttmserversourcelanguage=entargetlanguage=fitext=januaryformat=jsonfm
** Docs 
https://www.mediawiki.org/wiki/Help:Extension:Translate/Translation_memories#TTMServer_API
** All Wikimedia projects share one 
http://laxstrom.name/blag/2012/09/07/translation-memory-all-wikimedia-wikis/
** We could join forces if more FLOSS projects used Translate 
https://translatewiki.net/wiki/Translate_Roll


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Erik Moeller
On Wed, Apr 24, 2013 at 12:06 AM, MZMcBride z...@mzmcbride.com wrote:

 Though the Wikimedia community seems eager to add new projects (Wikidata,
 Wikivoyage), I wonder how it can be sensible or reasonable to focus on yet
 another project when the current projects are largely neglected (Wikinews,
 Wikisource, Wikiversity, Wikibooks, Wikiquote, Wiktionary, etc.).

I've stated before why I disagree with this characterization, and I
reject this framing. Functionality like the Visual Editor, the mobile
site improvements, Lua, and other core engineering initiatives aren't
limited in their impact to Wikipedia. The recent efforts on mobile
uploading are actually focused on Commons. Deploying new software
every two weeks and continually making key usability improvements is
not what neglect looks like.

What WMF rarely does is directly focus effort on functionality that
primarily serves narrower use cases, which I think is appropriate at
this point in the history of our endeavor. My view is that such narrow
more vertically focused efforts should be enabled and supported by
creating structures like Labs where volunteers can meaningfully
prototype specialized functionality and work towards deployment on the
cluster.

Moreover, the lens of project/domain name is a very arbitrary one to
define vertically focused efforts. There are specialized efforts
within Wikipedia that have more scale today than some of our sister
projects do, such as individual WikiProjects. There are efforts like
the partnerships with cultural institutions which have led to hundreds
of thousands of images being made available under a free license. Yet
I don't see you complaining about lack of support for GLAM tooling, or
WikiProject support (both of which are needed). Why should English
Wikinews with 15 active editors demand more collective attention than
any other specialized efforts?

Historically, we've drawn that project/domain name dividing line
because starting a new wiki was the best way to put a flag in the
ground and say We will solve problem X. And we didn't know which
efforts would immediately succeed and which ones wouldn't. But in the
year 2013, you could just as well argue that instead of slapping the
Wikispecies logo on the frontpage of Wikipedia, we should make more
prominent mention of How to contribute video on Wikipedia or Work
with your local museum or Become a campus ambassador or any other
specialized effort which has shown promise but could use that extra
visibility. The idea that just because user X proposed project Y
sometime back in the early years of Wikimedia, effort Y must forever
be part of a first order prioritization lens, is not rationally
defensible.

So, even when our goal isn't simply to make general site improvements
that benefit everyone but to support specialized new forms of content
or collaboration, I wouldn't use project/domain name division as a
tool for assessing impact, but rather frame it in terms of What
problem is being solved here? Who is going to be reached? How many
people will be impacted? And sometimes that does translate well to
lens of a single domain name level project, and sometimes it doesn't.

 There's a general trend currently within the Wikimedia Foundation to
 narrow focus, which includes shelling out third-party MediaWiki release
 support to an outside contractor or group, because there are apparently
 not enough resources within the Wikimedia Foundation's 160-plus staff to
 support the Wikimedia software platform for anyone other than Wikimedia.

It's not a question whether we have enough resources to support it,
but how to best put a financial boundary around third party
engagement, while also actually enabling third parties to play an
important role in the process as well (including potentially chipping
in financial support).

 In light of this, it seems even more unreasonable and against good sense
 to pursue a new machine translation endeavor, virtuous as it may be.

To be clear: I was not proposing that WMF should undertake such an
effort directly. But I do think that if there are ways to support an
effort that has a reasonable probability of success, with a reasonable
structure of accountability around such an engagement, it's worth
assessing. And again, that position is entirely consistent with my
view that WMF should primarily invest in technologies with broad
horizontal impact (which open source MT could have) rather than
narrower, vertical impact.

Erik
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

Wikipedia and our other projects reach more than 500 million people every
month. The world population is estimated to be 7 billion. Still a long
way to go. Support us. Join us. Share: https://wikimediafoundation.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Mathias Schindler
On Wed, Apr 24, 2013 at 8:29 AM, Erik Moeller e...@wikimedia.org wrote:


 Could open source MT be such a strategic investment? I don't know, but
 I'd like to at least raise the question. I think the alternative will
 be, for the foreseeable future, to accept that this piece of
 technology will be proprietary, and to rely on goodwill for any
 integration that concerns Wikimedia. Not the worst outcome, but also
 not the best one.

 Are there open source MT efforts that are close enough to merit
 scrutiny?

http://www.statmt.org/moses/ is live an kicking. Someone with a
background in computer linguistics should have a close look at them.

I would like to mention however that there are a couple of cases in
which commercial companies could be convinced to open source some of
their software, for example Mozilla. Google has open sourced terract
for OCR. Google might see the value of their translation efforts in
more than just the software but also in the actual integration in some
of their products (Gmail, Goggles, Glass) so that open sourcing it
would not hurt their financial interests. It appears to me that the
cost vs. the potential gain for everyone of simply asking a company
like Google or Microsoft if they are willing to negotiate.

In any case, I would love to see WMF engage in the topic of machine translation.

Mathias

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Federico Leva (Nemo)

Erik Moeller, 24/04/2013 10:06:

[...] Moreover, the lens of project/domain name is a very arbitrary one to
define vertically focused efforts.


A good and interesting reasoning here. Indeed something to keep in mind, 
but which adds problems.



There are specialized efforts
within Wikipedia that have more scale today than some of our sister
projects do, such as individual WikiProjects. There are efforts like
the partnerships with cultural institutions which have led to hundreds
of thousands of images being made available under a free license. Yet
I don't see you complaining about lack of support for GLAM tooling, or
WikiProject support (both of which are needed).


You're perhaps right about MZ, but surely GLAM tooling is something 
often asked; however it arguably falls under Commons development.
I've no idea of what WikiProject support you have in mind, and surely 
WikiProjects are too often dangerous factions to be disbanded rather 
than encouraged, but we may agree in principle.



Why should English
Wikinews with 15 active editors demand more collective attention than
any other specialized efforts?

Historically, we've drawn that project/domain name dividing line
because starting a new wiki was the best way to put a flag in the
ground and say We will solve problem X. And we didn't know which
efforts would immediately succeed and which ones wouldn't. But in the
year 2013, you could just as well argue that instead of slapping the
Wikispecies logo on the frontpage of Wikipedia, we should make more
prominent mention of How to contribute video on Wikipedia or Work
with your local museum or Become a campus ambassador or any other
specialized effort which has shown promise but could use that extra
visibility.


Again, how to contribute video is just Commons promotion, work with 
museums is usually either Commons or Wikipedia (sometimes Wikisource), 
campus ambassadors are a program to improve some articles on some 
Wikipedias.
What I mean to say is those are means rather than goals; you're not 
disagreeing with MZ that we shouldn't expand our goals further.



The idea that just because user X proposed project Y
sometime back in the early years of Wikimedia, effort Y must forever
be part of a first order prioritization lens, is not rationally
defensible.

So, even when our goal isn't simply to make general site improvements
that benefit everyone but to support specialized new forms of content
or collaboration, I wouldn't use project/domain name division as a
tool for assessing impact, but rather frame it in terms of What
problem is being solved here? Who is going to be reached? How many
people will be impacted? And sometimes that does translate well to
lens of a single domain name level project, and sometimes it doesn't.


There's a general trend currently within the Wikimedia Foundation to
narrow focus, which includes shelling out third-party MediaWiki release
support to an outside contractor or group, because there are apparently
not enough resources within the Wikimedia Foundation's 160-plus staff to
support the Wikimedia software platform for anyone other than Wikimedia.


It's not a question whether we have enough resources to support it,
but how to best put a financial boundary around third party
engagement, while also actually enabling third parties to play an
important role in the process as well (including potentially chipping
in financial support).


In light of this, it seems even more unreasonable and against good sense
to pursue a new machine translation endeavor, virtuous as it may be.


To be clear: I was not proposing that WMF should undertake such an
effort directly. But I do think that if there are ways to support an
effort that has a reasonable probability of success, with a reasonable
structure of accountability around such an engagement, it's worth
assessing. And again, that position is entirely consistent with my
view that WMF should primarily invest in technologies with broad
horizontal impact (which open source MT could have) rather than
narrower, vertical impact.


In other words we wouldn't be adding another goal alongside those of 
creating an encyclopedia, a media repository, a dictionary, a dictionary 
of quotations etc. etc. but only a tool to the extent needed by one or 
more of them?
Currently the only projects using machine translation or translation 
memory are our backstage wikis, the MediaWiki interface translation and 
some highly controversial article creation drives on a handful small 
wikis (did they continue in the last couple years?). Many ways exist to 
expand the scope of such a tool and the corpus we could provide to it, 
but the rationale of your proposal is currently a bit lacking and needs 
some work, just this.


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Mark

On 4/24/13 8:29 AM, Erik Moeller wrote:

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.



I do think this is strategically relevant to Wikimedia. But there is 
already significant financial backing attempting to kickstart 
open-source MT, with some results. The goal is strategically relevant to 
another, much larger organization: the European Union. From 2006 through 
2012 they allocated about $10m to kickstart open-source MT, though 
focused primarily on European languages, via the EuroMatrix (2006-09) 
and EuroMatrixPlus (2009-12) research projects. One of the concrete 
results [1] of those projects was Moses, which I believe is currently 
the most actively developed open-source MT system. 
http://www.statmt.org/moses/


In light of that, I would suggest trying to see if we can adapt or join 
those efforts, rather than starting a new project or organization. One 
strategy could be to: 1) fund internal Wikimedia work to see if Moses 
can already be used for our purposes; and 2) fund improvements in cases 
where it isn't good enough yet (whether this is best done through grants 
to academic researchers, payments to contractors, hiring internal staff, 
or posting open bounties for implementing features, I haven't thought 
much about).


Best,
Mark

[1] They have a nice list of other software and data coming out of the 
project as well: http://www.euromatrixplus.net/resources/


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Denny Vrandečić
Erik, all,

sorry for the long mail.

Incidentally, I have been thinking in this direction myself for a while,
and I have come to a number of conclusions:
1) the Wikimedia movement can not, in its current state, tackle the problem
of machine translation of arbitrary text from and to all of our supported
languages
2) the Wikimedia movement is probably the single most important source of
training data already. Research that I have done with colleagues based on
Wikimedia corpora as training data easily beat other corpora, and others
are using Wikimedia corpora routinely already. There is not much we can
improve here, actually
3) Wiktionary could be an even more amazing resource if we would finally
tackle the issue of structuring its content more appropriately. I think
Wikidata opened a few venues to structure planning in this direction and
provide some software, but this would have the potential to provide more
support for any external project than many other things we could tackle

Looking at the first statement, there are two ways we could constrain it to
make it possibly feasible:
a) constrain the number of supported languages. Whereas this would be
technically the simpler solution, I think there is agreement that this is
not in our interest at all
b) constrain the kind of input text we want to support

If we constrain b) a lot, we could just go and develop pages to display
for pages that do not exist yet based on Wikidata in the smaller
languages. That's a far cry from machine translating the articles, but it
would be a low hanging fruit. And it might help with a desire which is
evidently strongly expressed by the mass creation of articles through bots
in a growing number of languages. Even more constraints would still allow
us to use Wikidata items for tagging and structuring Commons in a
language-independent way (this was suggested by Erik earlier).

Current machine translation research aims at using massive machine learning
supported systems. They usually require big parallel corpora. We do not
have big parallel corpora (Wikipedia articles are not translations of each
other, in general), especially not for many languages, and there is no
reason to believe this is going to change. I would question if we want to
build an infrastructure for gathering those corpora from the Web
continuously. I do not think we can compete in this arena, or that is the
best use of our resources to support projects in this area. We should use
our unique features to our advantage.

How can we use the unique features of the Wikimedia movement to our
advantage? What are our unique features? Well, obviously, the awesome
community we are. Our technology, as amazing as it is, running our Websites
on the given budget, is nevertheless not what makes us what we are. Most
processes on the Wikimedia projects are developed in the community space,
and not implemented in bits. To summon Lessing, if code is law, Wikimedia
projects are really good in creating a space that allows for a community to
live in this space and have the freedom to create their own ecosystem.

One idea I have been mulling over for years is basically how can we use
this advantage for the task of creating content available in many
languages. Wikidata is an obvious attempt at that, but it really goes only
so far. The system I am really aiming at is a different one, and there has
been plenty of related work in this direction: imagine a wiki where you
enter or edit content, sentence by sentence, but the natural language
representation is just a surface syntax for an internal structure. Your
editing interface is a constrained, but natural language. Now, in order to
really make this fly, both the rules for the parsers (interpreting the
input) and the serializer (creating the output) would need to be editable
by the community - in addition to the content itself. There are a number of
major challenges involved, but I have by now a fair idea of how to tackle
most of them (and I don't have the time to detail them right now). Wikidata
had some design decision inside it that are already geared towards enabling
the solution for some of the problems for this kind of wiki. Whatever a
structured Wiktionary would look like, it should also be aligned with the
requirements of the project outlined here. Basically, we take constrain b,
but make it possible to push the constraint further and further through the
community - that's how we could scale on this task.

This would be far away from solving the problem of automatic translation of
text, and even further away from understanding text. But given where we are
and the resources we have available, I think it would be a more feasible
path towards achieving the mission of the Wikimedia movement than tackling
the problem of general machine learning.

In summary, I see four calls for action right now (and for all of them this
means to first actually think more and write down a project plan and gather
input on that), that could and should be 

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Mark

A brief addendum,

On 4/24/13 12:25 PM, Mark wrote:
From 2006 through 2012 [the ERC] allocated about $10m to kickstart 
open-source MT, though focused primarily on European languages, via 
the EuroMatrix (2006-09) and EuroMatrixPlus (2009-12) research projects.


Missed some projects. Seems the European Research Council is *really* 
pushing for this, with more like $20-25m overall. A few FP7 projects 
that may be useful to us:


* Let's MT! https://www.letsmt.eu/, which is supposed to organize 
resources to help organizations  companies build their own MT systems 
on open data and software, reducing reliance on closed-source cloud 
providers.


* MosesCore http://www.statmt.org/mosescore/index.php?n=Main.HomePage, 
focused mainly on improving Moses itself.


* The Multilingual Europe Technology Alliance 
http://www.meta-net.eu/meta-research/overview, a giant consortium that 
seems to have a commitment to liberal licensing 
http://www.meta-net.eu/meta-share/licenses


-Mark


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Fred Bauder
This is closely tied to software which is being developed, some of it
secretly, to enable machines to understand and use language. As of now
this will be government and corporate owned and controlled. I say closely
tied because that is how translation works; only someone or something
that understands language can translate perfectly.

That said, crude translations into little used languages are nearly
worthless due to syntax issues. Useful work requires at least one person
fluent in the language.

Fred


 Could open source MT be such a strategic investment? I don't know, but
 I'd like to at least raise the question. I think the alternative will
 be, for the foreseeable future, to accept that this piece of
 technology will be proprietary, and to rely on goodwill for any
 integration that concerns Wikimedia. Not the worst outcome, but also
 not the best one.

 Are there open source MT efforts that are close enough to merit
 scrutiny? In order to be able to provide high quality result, you
 would need not only a motivated, well-intentioned group of people, but
 some of the smartest people in the field working on it.  I doubt we
 could more than kickstart an effort, but perhaps financial backing at
 significant scale could at least help a non-profit, open source effort
 to develop enough critical mass to go somewhere.

 All best,
 Erik

 [1]
 http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
 [2] https://developers.google.com/translate/v2/pricing
 --
 Erik Möller
 VP of Engineering and Product Development, Wikimedia Foundation

 Wikipedia and our other projects reach more than 500 million people every
 month. The world population is estimated to be 7 billion. Still a long
 way to go. Support us. Join us. Share: https://wikimediafoundation.org/

 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l




___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Andrew Gray
On 24 April 2013 11:35, Denny Vrandečić denny.vrande...@wikimedia.de wrote:

 If we constrain b) a lot, we could just go and develop pages to display
 for pages that do not exist yet based on Wikidata in the smaller
 languages. That's a far cry from machine translating the articles, but it
 would be a low hanging fruit. And it might help with a desire which is
 evidently strongly expressed by the mass creation of articles through bots
 in a growing number of languages.

There has historically been a lot of tension around mass-creation of
articles because of the maintenance problem - we can create two
hundred thousand stubs in Tibetan or Tamil, but who will maintain
them? Wikidata gives us the potential of squaring that circle, and in
fact you bring it up here...

 II ) develop a feature that blends into Wikipedia's search if an article
 about a topic does not exist yet, but we  have data on Wikidata about that
 topic

I think this would be amazing. A software hook that says we know X
article does not exist yet, but it is matched to Y topic on Wikidata
and pulls out core information, along with a set of localised
descriptions... we gain all the benefit of having stub articles
(scope, coverage) without the problems of a small community having to
curate a million pages. It's not the same as hand-written content, but
it's immeasurably better than no content, or even an attempt at
machine-translating free text.

XXX is [a species of: fish] [in the: Y family]. It [is found in: Laos,
Vietnam]. It [grows to: 20 cm]. (pictures)

Wikidata Phase 4, perhaps :-)

-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Pavlo Shevelo
 only someone or something
that understands language can translate perfectly

Precisely

 crude translations into little used languages are nearly
worthless due to syntax issues. Useful work requires at least one person
fluent in the language

It's very true!
Current Googe MT tools are reasonably good for reader(s) as they really
provide a chance to grasp the meaning of the text but they are far not good
as writer instrument meaning translation results are far not good to be
published.

So it seems reasonable to promote MT as instrument for visitors (readers)
of our projects, but not as substitute for that Wikimedians, who are the
contributors.


On Wed, Apr 24, 2013 at 2:03 PM, Fred Bauder fredb...@fairpoint.net wrote:

 This is closely tied to software which is being developed, some of it
 secretly, to enable machines to understand and use language. As of now
 this will be government and corporate owned and controlled. I say closely
 tied because that is how translation works; only someone or something
 that understands language can translate perfectly.

 That said, crude translations into little used languages are nearly
 worthless due to syntax issues. Useful work requires at least one person
 fluent in the language.

 Fred


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Nikola Smolenski

On 24/04/13 08:29, Erik Moeller wrote:

Could open source MT be such a strategic investment? I don't know, but
I'd like to at least raise the question. I think the alternative will
be, for the foreseeable future, to accept that this piece of
technology will be proprietary, and to rely on goodwill for any
integration that concerns Wikimedia. Not the worst outcome, but also
not the best one.

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.


A huge and worthwile effort on its own, and anyway a necessary step for 
creating free MT software, would be to build a free (as in freedom) 
parallel translation corpus. This corpus could then be used as the 
starting point by people and groups who are producing free MT software, 
either under WMF or on their own.


This could be done by creating a new project where volunteers could 
compare Wikipedia articles and other free translated texts and mark 
sentences that are translations of other sentences. By the way, I 
believe Google Translate's corpus was created in this way.


Perhaps this could be best achieved by teaming with www.zooniverse.org 
or www.pgdp.net who have experience in this kind of projects. This would 
require specialized non-wiki software, and I don't think that the 
Foundation has enough experience in developing it.


(By the way, similar things that could be similarly useful include free 
OCR training data or free fully annotated text.)


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Mathieu Stumpf

Le 2013-04-24 08:29, Erik Moeller a écrit :

Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, 
but

some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source 
effort

to develop enough critical mass to go somewhere.


I would like to add that (I'm no specialist of this subject) 
translating natural language probably need at least a large set of 
existing translations, at least to get read of obvious well known 
idiotisms like kitchen sink translated usine à gaz when you are 
speaking of a software for example. On this regard, we probably have 
such a base with wikisource. What do you think?





All best,
Erik

[1]

http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
[2] https://developers.google.com/translate/v2/pricing
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

Wikipedia and our other projects reach more than 500 million people 
every
month. The world population is estimated to be 7 billion. Still a 
long
way to go. Support us. Join us. Share: 
https://wikimediafoundation.org/


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Andrea Zanni
On Wed, Apr 24, 2013 at 2:04 PM, Mathieu Stumpf 
psychosl...@culture-libre.org wrote:

 I would like to add that (I'm no specialist of this subject) translating
 natural language probably need at least a large set of existing
 translations, at least to get read of obvious well known idiotisms like
 kitchen sink translated usine à gaz when you are speaking of a software
 for example. On this regard, we probably have such a base with wikisource.
 What do you think?


Personally, I think this is an awesome idea :-)
Wikisource corpora could be a huge asset in developing this.
We already host different public domain translations, and in the future, we
hope, more and more Wikisources will allow user generated translations.

At the moment, Wikisource could be a interesting corpora and laboratory for
improving and enhancing OCR,
as the OCR generated text is always proofread and corrected by humans.
As part of our project (
http://wikisource.org/wiki/Wikisource_vision_development), Micru was
looking for a GSoC candidate for studing the reinsertion of proofread text
into djvus [1], but at the moment didn't find any interested student. We
have some contacts with people at Google working on Tesseract, and they
were available for mentoring.

Aubrey

[1] We thought about this both for OCR enhancement purposes and files
updating on Commons and Internet Archive (which is off topic here).
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Mathieu Stumpf

Le 2013-04-24 12:35, Denny Vrandečić a écrit :
3) Wiktionary could be an even more amazing resource if we would 
finally
tackle the issue of structuring its content more appropriately. I 
think
Wikidata opened a few venues to structure planning in this direction 
and
provide some software, but this would have the potential to provide 
more
support for any external project than many other things we could 
tackle


If you have any information/idea related to Wikitionary structuration, 
please share it on https://meta.wikimedia.org/wiki/Wiktionary_future



One idea I have been mulling over for years is basically how can we 
use

this advantage for the task of creating content available in many
languages. Wikidata is an obvious attempt at that, but it really goes 
only
so far. The system I am really aiming at is a different one, and 
there has
been plenty of related work in this direction: imagine a wiki where 
you

enter or edit content, sentence by sentence, but the natural language
representation is just a surface syntax for an internal structure.


I don't understand what you mean. To begin with, I doubt that sentence 
is the good scale to translate a natural language discourse. Sure some 
time you may translate one word with one word in an other language. 
Sometime you may translate a sentence with one sentence. Sometime you 
need to grab the whole paragraph, or even more, and sometime you need to 
have a whole cultural background to get the meaning of a single word in 
the current context. To my mind, natural languages deals with more than 
context free language. Could a static internal structure deal with 
such a dynamics?



Your
editing interface is a constrained, but natural language.


This is realy where I don't see how you hope to manage that.


Now, in order to
really make this fly, both the rules for the parsers (interpreting 
the
input) and the serializer (creating the output) would need to be 
editable
by the community - in addition to the content itself. There are a 
number of
major challenges involved, but I have by now a fair idea of how to 
tackle

most of them (and I don't have the time to detail them right now).


Well I'll be curious to have more information, like references I should 
read. Otherwise I'm affraid that what you says sounds like the Fermat's 
Last Theorem[1] and the famous margin which was too small to contain 
Fermat's alleged proof of his last theorem.



[1] https://en.wikipedia.org/wiki/Fermat%27s_Last_Theorem


--
Association Culture-Libre
http://www.culture-libre.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Samuel Klein
I really like Erik's original suggestion, and these ideas, Denny.

Since there are many different possible goals, it's worth having a
page just to list all of the possible different goals and compare them
- both how they fit with one another and how they fit with existing
active projects elsewhere on the web.

SJ

On Wed, Apr 24, 2013 at 6:35 AM, Denny Vrandečić
denny.vrande...@wikimedia.de wrote:
 Erik, all,

 sorry for the long mail.

 Incidentally, I have been thinking in this direction myself for a while,
 and I have come to a number of conclusions:
 1) the Wikimedia movement can not, in its current state, tackle the problem
 of machine translation of arbitrary text from and to all of our supported
 languages
 2) the Wikimedia movement is probably the single most important source of
 training data already. Research that I have done with colleagues based on
 Wikimedia corpora as training data easily beat other corpora, and others
 are using Wikimedia corpora routinely already. There is not much we can
 improve here, actually
 3) Wiktionary could be an even more amazing resource if we would finally
 tackle the issue of structuring its content more appropriately. I think
 Wikidata opened a few venues to structure planning in this direction and
 provide some software, but this would have the potential to provide more
 support for any external project than many other things we could tackle

 Looking at the first statement, there are two ways we could constrain it to
 make it possibly feasible:
 a) constrain the number of supported languages. Whereas this would be
 technically the simpler solution, I think there is agreement that this is
 not in our interest at all
 b) constrain the kind of input text we want to support

 If we constrain b) a lot, we could just go and develop pages to display
 for pages that do not exist yet based on Wikidata in the smaller
 languages. That's a far cry from machine translating the articles, but it
 would be a low hanging fruit. And it might help with a desire which is
 evidently strongly expressed by the mass creation of articles through bots
 in a growing number of languages. Even more constraints would still allow
 us to use Wikidata items for tagging and structuring Commons in a
 language-independent way (this was suggested by Erik earlier).

 Current machine translation research aims at using massive machine learning
 supported systems. They usually require big parallel corpora. We do not
 have big parallel corpora (Wikipedia articles are not translations of each
 other, in general), especially not for many languages, and there is no
 reason to believe this is going to change. I would question if we want to
 build an infrastructure for gathering those corpora from the Web
 continuously. I do not think we can compete in this arena, or that is the
 best use of our resources to support projects in this area. We should use
 our unique features to our advantage.

 How can we use the unique features of the Wikimedia movement to our
 advantage? What are our unique features? Well, obviously, the awesome
 community we are. Our technology, as amazing as it is, running our Websites
 on the given budget, is nevertheless not what makes us what we are. Most
 processes on the Wikimedia projects are developed in the community space,
 and not implemented in bits. To summon Lessing, if code is law, Wikimedia
 projects are really good in creating a space that allows for a community to
 live in this space and have the freedom to create their own ecosystem.

 One idea I have been mulling over for years is basically how can we use
 this advantage for the task of creating content available in many
 languages. Wikidata is an obvious attempt at that, but it really goes only
 so far. The system I am really aiming at is a different one, and there has
 been plenty of related work in this direction: imagine a wiki where you
 enter or edit content, sentence by sentence, but the natural language
 representation is just a surface syntax for an internal structure. Your
 editing interface is a constrained, but natural language. Now, in order to
 really make this fly, both the rules for the parsers (interpreting the
 input) and the serializer (creating the output) would need to be editable
 by the community - in addition to the content itself. There are a number of
 major challenges involved, but I have by now a fair idea of how to tackle
 most of them (and I don't have the time to detail them right now). Wikidata
 had some design decision inside it that are already geared towards enabling
 the solution for some of the problems for this kind of wiki. Whatever a
 structured Wiktionary would look like, it should also be aligned with the
 requirements of the project outlined here. Basically, we take constrain b,
 but make it possible to push the constraint further and further through the
 community - that's how we could scale on this task.

 This would be far away from solving the problem of automatic 

Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread Leslie Carr
(FYI this is me speaking with my personal hat on, none of these
opinions are official in any way or the opinions of the foundation as
an organization)

personal_hat


 While Wikimedia is still only a medium-sized organization, it is not
 poor. With more than 1M donors supporting our mission and a cash
 position of $40M, we do now have a greater ability to make strategic
 investments that further our mission, as communicated to our donors.
 That's a serious level of trust and not to be taken lightly, either by
 irresponsibly spending, or by ignoring our ability to do good.

 Could open source MT be such a strategic investment? I don't know, but
 I'd like to at least raise the question. I think the alternative will
 be, for the foreseeable future, to accept that this piece of
 technology will be proprietary, and to rely on goodwill for any
 integration that concerns Wikimedia. Not the worst outcome, but also
 not the best one.

I think that while supporting open source machine translation is an
awesome goal, it is out of scope of our budget and the engineering
budget could be better spent elsewhere, such as with completing
existing tools that are in development, but not
deployed/optimized/etc.  I think that putting a bunch of money into
possibilities isn't the right thing to do when we have a lot of
projects that need to be finished and deployed yesterday.  Maybe once
there's a closer actual project we could support them with text
streams, decommissioned machines, and maybe money, but only after it's
a pretty sure investment

/personal_hat

Leslie


 Are there open source MT efforts that are close enough to merit
 scrutiny? In order to be able to provide high quality result, you
 would need not only a motivated, well-intentioned group of people, but
 some of the smartest people in the field working on it.  I doubt we
 could more than kickstart an effort, but perhaps financial backing at
 significant scale could at least help a non-profit, open source effort
 to develop enough critical mass to go somewhere.

 All best,
 Erik

 [1] 
 http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
 [2] https://developers.google.com/translate/v2/pricing
 --
 Erik Möller
 VP of Engineering and Product Development, Wikimedia Foundation

 Wikipedia and our other projects reach more than 500 million people every
 month. The world population is estimated to be 7 billion. Still a long
 way to go. Support us. Join us. Share: https://wikimediafoundation.org/

 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l



--
Leslie Carr
Wikimedia Foundation
AS 14907, 43821
http://as14907.peeringdb.com/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] The case for supporting open source machine translation

2013-04-24 Thread George Herbert
Leslie Carr wrote (personally, not officially):

I think that while supporting open source machine translation is an
 awesome goal, it is out of scope of our budget and the engineering
 budget could be better spent elsewhere, such as with completing
 existing tools that are in development, but not
 deployed/optimized/etc.  I think that putting a bunch of money into
 possibilities isn't the right thing to do when we have a lot of
 projects that need to be finished and deployed yesterday.  Maybe once
 there's a closer actual project we could support them with text
 streams, decommissioned machines, and maybe money, but only after it's
 a pretty sure investment


I don't think that it's a good idea to shift resources to it immediately,
but I think that every now and then it's very healthy to step back and ask
What is standing between our users and the information they seek?  What is
standing between our editors and the information they want to update?.
 Generically, the customers and customer goals problem, applied to WMF's
two customer sets (readers, and editors).

Minor UI changes help readers.  Most of the other changes are
editor-focused, retention or ease of editing or various other things
related to that.  A few are strategic-data-organization related which are
more of a multiplier effect.

The readers and potential readers ARE however clearly disadvantaged by
translation issues.

I see this discussion and consideration as strategic; not planning (year,
six month) timescales or tactical (month, week) timescales, but a
multi-year What are our main goals for information access? timescale.

We can't usefully help with internet access (and that's proceeding at good
pace even in the third world), but language will remain a barrier when
people get access.  In a few situations politics / firewalling is as well
(China, primarily), which is another strategic challenge.  That, however,
is political and geopolitical, and not an easy nut for WMF to crack.  Of
the three issues - net, firewalling, and language, one of them is something
we can work on.  We should think about how to work on that.  MT seems like
an obvious answer, but not the only possible one.




On Wed, Apr 24, 2013 at 12:29 PM, Leslie Carr lc...@wikimedia.org wrote:

 (FYI this is me speaking with my personal hat on, none of these
 opinions are official in any way or the opinions of the foundation as
 an organization)

 personal_hat

 
  While Wikimedia is still only a medium-sized organization, it is not
  poor. With more than 1M donors supporting our mission and a cash
  position of $40M, we do now have a greater ability to make strategic
  investments that further our mission, as communicated to our donors.
  That's a serious level of trust and not to be taken lightly, either by
  irresponsibly spending, or by ignoring our ability to do good.
 
  Could open source MT be such a strategic investment? I don't know, but
  I'd like to at least raise the question. I think the alternative will
  be, for the foreseeable future, to accept that this piece of
  technology will be proprietary, and to rely on goodwill for any
  integration that concerns Wikimedia. Not the worst outcome, but also
  not the best one.

 I think that while supporting open source machine translation is an
 awesome goal, it is out of scope of our budget and the engineering
 budget could be better spent elsewhere, such as with completing
 existing tools that are in development, but not
 deployed/optimized/etc.  I think that putting a bunch of money into
 possibilities isn't the right thing to do when we have a lot of
 projects that need to be finished and deployed yesterday.  Maybe once
 there's a closer actual project we could support them with text
 streams, decommissioned machines, and maybe money, but only after it's
 a pretty sure investment

 /personal_hat

 Leslie

 
  Are there open source MT efforts that are close enough to merit
  scrutiny? In order to be able to provide high quality result, you
  would need not only a motivated, well-intentioned group of people, but
  some of the smartest people in the field working on it.  I doubt we
  could more than kickstart an effort, but perhaps financial backing at
  significant scale could at least help a non-profit, open source effort
  to develop enough critical mass to go somewhere.
 
  All best,
  Erik
 
  [1]
 http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
  [2] https://developers.google.com/translate/v2/pricing
  --
  Erik Möller
  VP of Engineering and Product Development, Wikimedia Foundation
 
  Wikipedia and our other projects reach more than 500 million people every
  month. The world population is estimated to be 7 billion. Still a long
  way to go. Support us. Join us. Share: https://wikimediafoundation.org/
 
  ___
  Wikimedia-l mailing list
  Wikimedia-l@lists.wikimedia.org
  Unsubscribe: