Hi everyone

Or rather, hi programmers. This is a nice technical post for keen hackers.

Virtaal makes an appearance quite late in the show, but it's useful to 
read the entire post to understand where I'm going with this.

**

In Pootle, you can filter information in lots of interesting ways. Here 
are three scenarios:

   * Display all files in a goal,
   * Go through all units which fail a particular quality check,
   * Search for a string in the targets of all units in a directory.

Some of these can be combined, and in the future, it might be useful to 
allow the user to combine all of these to search for data.

**

Today, we store our indexing information in three places:

   * Goals and assignments are stored in the Django database,
   * Quality checks and stats are stored in the stats database,
   * Text indices are stored in Xapian/Lucene.

This leads to inefficiencies:

   * When searching for a string within a goal, Pootle gets the list of 
filename-unit pairs from the text indexer in which to search, but then, 
for each filename, it has to hit the Django database to check whether 
the filename is part of the current goal.
   * When searching for all units that fail a quality check within a 
goal, Pootle gets a list of filenames that fall within the goal, and 
then, for each file, has to hit the stats database to check whether the 
file contains any units that fail the current quality check.

**

How do we solve this? It depends which component we're focusing on.

DENORMALIZATION

The text indexing engine (i.e. Xapian, Lucene, etc.) doesn't need to be 
100% consistent with our data (we want consistency of course, but Pootle 
won't break horribly if things are a bit out of sync).

Thus, we can duplicate stats, goal and assignment data into the text 
indexing engine. This is very convenient from a search perspective, 
since the user can do very complex searches which will only hit the text 
indexing engine.

MERGING DATABASES

By storing stats information in the database used by Django, we can do 
complicated stats queries directly in Django's database.

The current model where we have stats associated with individual 
filenames breaks this model - it's not possible to do queries over 
groups of files. It's also expensive, since we need to hit the stats 
database multiple times to get stats information for multiple files.

CACHING TRANSLATION FILES IN THE DATABASE

If we go further and store parsed PO and XLIFF files in the database, we 
can associate stats information directly with units.

Thus, we'd create a subclass of TranslationStore (in storage/base.py in 
the Toolkit) which would store its units in the database. This would 
allow our existing tools to operate on database-backed translation 
stores as normal stores.

It also means that stats information for a unit will be associated using 
foreign key relations.

**

BINDING EVERYTHING TOGETHER NEATLY

We'd have to design a query API that's very database design centered 
(something that takes the above ideas into account). This API should 
allow complicated queries including:

   * String searches,
   * Filtering by goal and assignments,
   * Filtering by quality checks.

The API should also provide unit update services. If a unit is updated, 
it should update the text indexing engine with the quality check 
information as well as the text content. If goal or assignment data is 
changed, the text indexing engine should also be updated.

KEEP AGGREGATION IN MIND

Since the API is database-centered, it should make aggregated queries 
easy and efficient. Thus, we need to stop thinking in terms of stats 
that are associated with a single filename.

**

WHERE DOES VIRTAAL FIT IN?

If all of the above is implemented, Virtaal could always directly used 
database-backed translation stores to do its work. Virtaal would 
instantly benefit from the fast indexing code which would make Pootle 
fast. Complex queries on big files would be very fast. And Virtaal would 
use much less memory when dealing with large files.

This would mean excellent re-use of code between Virtaal & Pootle.

**

I know this was long, but I hope useful. Let me know what you think.

Thanks to Alaa for many of the ideas here.

Cheers
Wynand



------------------------------------------------------------------------------
_______________________________________________
Translate-pootle mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/translate-pootle

Reply via email to