Hi everyone Or rather, hi programmers. This is a nice technical post for keen hackers.
Virtaal makes an appearance quite late in the show, but it's useful to read the entire post to understand where I'm going with this. ** In Pootle, you can filter information in lots of interesting ways. Here are three scenarios: * Display all files in a goal, * Go through all units which fail a particular quality check, * Search for a string in the targets of all units in a directory. Some of these can be combined, and in the future, it might be useful to allow the user to combine all of these to search for data. ** Today, we store our indexing information in three places: * Goals and assignments are stored in the Django database, * Quality checks and stats are stored in the stats database, * Text indices are stored in Xapian/Lucene. This leads to inefficiencies: * When searching for a string within a goal, Pootle gets the list of filename-unit pairs from the text indexer in which to search, but then, for each filename, it has to hit the Django database to check whether the filename is part of the current goal. * When searching for all units that fail a quality check within a goal, Pootle gets a list of filenames that fall within the goal, and then, for each file, has to hit the stats database to check whether the file contains any units that fail the current quality check. ** How do we solve this? It depends which component we're focusing on. DENORMALIZATION The text indexing engine (i.e. Xapian, Lucene, etc.) doesn't need to be 100% consistent with our data (we want consistency of course, but Pootle won't break horribly if things are a bit out of sync). Thus, we can duplicate stats, goal and assignment data into the text indexing engine. This is very convenient from a search perspective, since the user can do very complex searches which will only hit the text indexing engine. MERGING DATABASES By storing stats information in the database used by Django, we can do complicated stats queries directly in Django's database. The current model where we have stats associated with individual filenames breaks this model - it's not possible to do queries over groups of files. It's also expensive, since we need to hit the stats database multiple times to get stats information for multiple files. CACHING TRANSLATION FILES IN THE DATABASE If we go further and store parsed PO and XLIFF files in the database, we can associate stats information directly with units. Thus, we'd create a subclass of TranslationStore (in storage/base.py in the Toolkit) which would store its units in the database. This would allow our existing tools to operate on database-backed translation stores as normal stores. It also means that stats information for a unit will be associated using foreign key relations. ** BINDING EVERYTHING TOGETHER NEATLY We'd have to design a query API that's very database design centered (something that takes the above ideas into account). This API should allow complicated queries including: * String searches, * Filtering by goal and assignments, * Filtering by quality checks. The API should also provide unit update services. If a unit is updated, it should update the text indexing engine with the quality check information as well as the text content. If goal or assignment data is changed, the text indexing engine should also be updated. KEEP AGGREGATION IN MIND Since the API is database-centered, it should make aggregated queries easy and efficient. Thus, we need to stop thinking in terms of stats that are associated with a single filename. ** WHERE DOES VIRTAAL FIT IN? If all of the above is implemented, Virtaal could always directly used database-backed translation stores to do its work. Virtaal would instantly benefit from the fast indexing code which would make Pootle fast. Complex queries on big files would be very fast. And Virtaal would use much less memory when dealing with large files. This would mean excellent re-use of code between Virtaal & Pootle. ** I know this was long, but I hope useful. Let me know what you think. Thanks to Alaa for many of the ideas here. Cheers Wynand ------------------------------------------------------------------------------ _______________________________________________ Translate-pootle mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/translate-pootle
