Re: [translate-pootle] Wordforge

Javier SOLA Fri, 09 Jun 2006 02:55:25 -0700

Hi Aigars,

I think that this discussion is very important.

We need to ensure that Pootle is capable of handling the large amounts 
of information that Debian needs, which is probably not the problem, but 
it must also handle all the processes that Debian needs. The solution 
might be either on file handling or in databases.

First, it is important to understand the complexity of the data that we 
are handling. we are not only talking about a set of source strings and 
their translations, associated in files. WE are talking about managing 
process information to optimize the result of the work of the 
translators. Each XLIFF file not only contains strings and information 
about them. It might also contain a glossary, translation memory 
information, comments from translators or reviewers, information about 
results of tests run on each string, data for conection to SVN... and 
process information: a series of phases through which each file has 
already gone 
(translation-review-approval-update-translation-review-approval...  ), 
associating each message to a given phase. We can also have translations 
of the same message to other langauges, as reference. Also, XLIFF files 
might include counters that give information about the state of the 
file, whitout having to recalculate.

All this information is easy to store in XML, but it would require quite 
a complex database.

My believe is that the process that will will use most time in Pootle is 
the process of merging two files, which must happen when a file is 
committed to SVN, when is uploaded to Pootle from a translator... or 
when a new POT/XLIFFT file is uploaded to Pootle for updating all 
translations of a given package (much more efficient that doing all the 
languages one by one against CVS. If data is in a database, then at 
least one file does not need to be parsed (every time the process runs), 
and the process would probably be faster, but there are many other 
factors that could become more complicated because of the DB. Updates 
take place at non-critical times, but user demands for files mut be 
responded to immediatly, If all the files need to be created to be 
served to the user, this process might be longer that what the user is 
prepared to wait (I don't know)

My personal conclusion is that this is something that we really need to 
look at, and I am very happy that you and other people are getting into 
it... but it is not something that should be resolved now in order to 
start Guntaitas' project, there is too much at stake to rush a design 
decision that will affect all the future of Pootle. I would very much 
prefer that we -in this list- analyse the issue much further and come 
out with the right conclusion, which we will implement, as we are as 
interested as you are on making sure that Pootle scales and can respond 
the Debian's needs, which means that it will be able to respond to the 
needs of any other FOSS project.

As Christian has proposed, I think that if we can get separation of 
front-end and back-end now, and write the API, we will be able later (or 
in parallel) to store in databases all the information that we think 
might help creating a better Pootle.

I also think that we should start imediatly an analysis of what 
information might be interesting to have in a database and which 
information should be in databases. It might even be interesting to have 
the same information in both formats (every time an XLIFF file is 
created or modified, the info is stored in a database, which would work 
as a cache).

More comments below

Aigars Mahinovs wrote:

>In my opinion it would be quite problematic to implement the
>distributed version of this system by distributing the backend - that
>would totally bypass all the permissions and would cause all sorts of
>trust issues.
>
>It would be much more logical to have XML RPC or something like that
>and have the syncronisation processes launched by cron on regular
>basis and have the incoming data streams processed in accordance to
>the local rules. For example, messages from a trusted localisation
>team server could be integrated directly, but messages from Rosetta
>would go via some kind of approval dependent on the localization team
>practises.
>  
>
I think that you are right, this might be a very good way of doing it.

>
>I imagine that numer of times we need to write one string to the file
>(making or updating a translation) outnumbers the number of times we
>need to get the full file (download of the result) in the order of
>1000:1. And I also imagine that creating a PO file from said XLIFF
>will take just as much time as making it from a database (or even
>more).
>  
>
I think that people will tend to work offline, therefore managing files. 
The system is being developed for native use of XLIFF files, which make 
translation editors much easier to use for translators, creating PO 
files would only be for people who still do not want to change, for 
whatever the reasons.

>  
>
>>>The CPU may be more occupied in doing fuzzy matching of strings. I'm not
>>>sure the fuzzy matching algorithm can use some kind of cache in a
>>>database. (The number of fuzzy matching operation is more than
>>>proportionnal to the number of strings - which IMHO better reflects the
>>>size of the translation server than the number of the simultaneous users
>>>triggering write operations)
>>>      
>>>
>>The CPU is most occupied at startup, indexing and checking files.  This
>>would not change at all with a DB.  That needs to be backgrounded.
>>    
>>
>
>This would be completly eliminated by the DB, because the DB engine
>would be doing those tasks using highly optimised C and assembler
>code.
>  
>
We need to understand the processes here a little better and understand 
the need. Of course, any operation inside a database would be faster, 
there is no doubt of that, but there ar e a number of other things that 
need to be taken into account. If the result is that a DB is faster and 
does not make things too complicated, dababases it should be...

>
>Well, we need to think of the database schema in the way to use as
>much processing as possible on the database side.
>
>One other think about the database backend is that you can easily move
>the database to another server from the Pootle itself and also
>database software can be easily distributed to several servers if
>there is any kind of bottleneck there.
>  
>
This is definitelly true, and can make the server pair Pootle/DB server 
very powerful. We really need to look at this, and try to make a plan to 
export to a DB server as many tasks as possible, while making sure that 
we do not enter an over-complicated structure that later becomes to 
complicated to use or maintain.

Javier

_______________________________________________
Translate-pootle mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/translate-pootle

Re: [translate-pootle] Wordforge

Reply via email to