Re: [Sugar-devel] GSoC Translation Server Proposal

2013-04-25 Thread Erik Price

Hey Aneesh,

Thanks a for going through, I'll try to answer your questions and
clarify a bit for anyone else who's interested.

Comments / criticisms from others on the mailing list are very welcome!

I do apologize for the length of my messages, I wanted to be as specific
as possible.

 It would be good if you could add the expected time you'll require to
 complete each of these phases below. Also, leave a buffer of 3 days
 between phases for review + feedback + other additions.


Okay, will do. I think I'll have to wait until I figure out exactly what
will be required in order to make an informed analysis of the time, but
I expect the server / translation backend components to occupy the first
half of the allocated time, finishing up some time around the midterm
assessments. The client API should be a much simpler process, and will
hopefully take no longer than 2 weeks.

Again, I'll refine this to be far more specific when some more decisions
are finalized.


 ...snip...

 The first of these plug-ins would be using Apertium, the FOSS project
 already used by Sugar through the #meeting-es irc channel on
 freenode. Next, Bing Translate will likely be added, due to it being
 one of the major web translators that provides a free API key.


 Just FYI: Bing only provides free service for upto 2 million
 characters per month. (
 https://datamarket.azure.com/dataset/1899a118-d202-492c-aa16-ba21c33c06cb)


Yeah, and the cost for the next 2 million ($40 USD) is pretty
prohibitive in my view. Still, 2 million characters per month is not
terrible if it's being used by a relatively small distribution of
XOs / students.

I do agree though that if there is a more free service, it should take
preference.

 How about Bablefish? They don't have an API, but there is nothing which
 prevents you from creating one. And it seems like 20 lines of python code
 to me.


Assuming you're talking about babelfish.com since there appear to be
a couple services named babelfish.

I didn't notice anything in their terms of use that prohibited screen
scraping, so this definitely may look like a service to look in to. I'm
not at all familiar with the quality of the translations though.

I'm also a little wary of the fact that there appears to be so little
information on them as a service. It seems that for all intents and
purposes, the site barely exists.

There's also babelfish.de which looks somewhat promising, but prohibits
using automated scripts to grab data without the owner's permission.

 Google Translate is another high priority service due to its quality,
 but will not be added initially because its API has no free tier for
 usage.


 Do their terms and condition state that we can't make more than n
 requests? I read some threads on SO where people mentioned that they used
 some PHP code to make post requests to the google translate server. I just
 want to know that will this be illegal or will it void some of their terms
 and conditions?


Google deactivated their free Translate API a few years back, so there's
no official way to get a free translation from the service any more. It
is possible (and very easy) to screen scrape Google Translate, but it's
explicitly prohibited by the terms of service. Going against that would
look bad for Sugar Labs, especially considering that this would be a
Google Summer of Code project.

I personally am very much against the idea of going around any terms of
service agreement, but if someone really wanted to, there's nothing
stopping someone from developing a third-party plug-in for the server
independently of this project.

 - How do the clients become aware of the server? Is it configured, or
 is there some kind of auto-detection?


 I'd say we setup a public domain and hardcode it in the code!


I'm not sure I understand your intent here. By this, you mean having
only one server globally? In my mind, that would undermine the goal of
having a server in the first place.

Sure, having a default server run by Sugar Labs (or whoever) would
make things more convenient, but I don't think it should be the only
service. Part of the idea of this project is to allow users to
create their own servers customized to what they need to do.

Obviously having every activity that uses the API to rediscover the
translation server is far from ideal, and would result in a lot of
duplication. I'm not sure here, perhaps discovery could somehow be tied
into the jabber server the XO is connected to?

 - Is it reasonable to establish large servers with more resources to
 be used by XO users who may not have access to a server or the
 technical abilities to manage one? How would abuse be prevented?


 What abuse?

By abuse, I mean someone taking advantage of the server to provide them
with unlimited translations. This is essentially only problematic when
the server operator uses non-free translation services that may be
limited by number of characters or is rate-limited.

You also obviously don't want malicious 

[Sugar-devel] GSoC Translation Server Proposal

2013-04-23 Thread Erik Price

Hi everyone,

I'm interested in working with Sugar Labs for this year's Google Summer of Code,
and I wanted to get some feedback on my project proposal before I
actually submit my application.

I apologize for the relatively late introduction, I've been discussing this idea
on IRC since last week, but hadn't thought to put it up on the mailing list
until today.

That said, comments / suggestions / feedback on the idea would be greatly
appreciated.


Pluggable Translation Server - GSoC Idea Proposal
=

As a global project, internationalization is a central tenet of Sugar and
OLPC. The aim of this project is to establish a server program and client API
that can be used in activities to introduce a way to reliably access quality
machine translations of arbitrary strings.

Overview


Since accurate machine translation is a computationally and memory expensive
operation, it is not reasonable to expect good results from running directly on
an XO. A server to supply these translations to a larger network of XOs is
therefore a preferable solution to create these translations.

As not all translators are created equal for all possible language pairs, or may
not be possible in a given situation (due to hardware, software, monetary,
etc. limitations), it is advantageous to give our translation server program the
ability to access multiple services, via a plug-in architecture.

For example, Google Translate will likely offer very high quality translations
for many language pairs, but the associated cost of $20 USD/1 Million translated
characters through the API means that it is irresponsible to require
it. Likewise, a FOSS project such as Apertium may well provide good es-en
translations, but has no way of translating e.g. de-ru, which limits the global
usefulness.

To overcome these obstacles, pooling all possible translation sources into a
single server allows a convenient and consistent means of providing reliable
machine translation for any purpose.

Plan


This is a very general overview of my plan for finishing the project, and how it
will be split up. It will be split into appropriate weekly goals based on the
feedback I get regarding this initial division of work.

### Phase I

The first order of business for this project is to establish a
minimal-dependency Python HTTP server application with a plug-in architecture to
facilitate any interested developer to add machine translator backends later on
in the project.

Along with this, some initial backends will of course need to be created. I plan
to add one that would run on the same server, and one that would use a web
service, to ensure the robustness and generality of the server architecture.

The first of these plug-ins would be using Apertium, the FOSS project already
used by Sugar through the #meeting-es irc channel on freenode. Next, Bing
Translate will likely be added, due to it being one of the major web translators
that provides a free API key.

Google Translate is another high priority service due to its quality, but will
not be added initially because its API has no free tier for usage.

Some of these other plugins will be considered for any remaining time left at
the end of the project, but these are of course far lower priority than the
initial two systems, and will only be added during GSoC if possible. (If not,
I'll likely just add some other systems after GSoC has finished)

Though not yet finalized, the server will most likely use RESTful HTTP and JSON
responses to make it easily accessible from any programming language that wants
to interact with it.

### Phase II

The next leg in the project will involve creating a Python client API to request
and receive translations from a given server. This will of course be designed
before any coding starts on the server, and will be designed to be as generic
and straightforward to use as possible, so it can be used easily and efficiently
even outside of the sugar environment.

From the point of view of the client API, the backend the server is using to
actually translate the text is unimportant. it will just send a call to the
server, specifying the language pair and source text, and receive a resultant
string, or appropriate error. The server will handle selecting the appropriate
translator and any fallbacks that may be needed.

The API user need only specify the source text and the language pair to
translate in order to interact with the server.

### Phase III

This stage of the project involves the addition of the client API to the Chat
activity. As Chat is a very simple activity, this should not take much time at
all, and a new Translate activity will be developed in addition.

This activity will be very, very simplistic, and while functional, essentially a
demo of how to use the client API and server. This will also allow me to give
some additional real world testing to the programs, so that any potential issues
can be caught while there's