[translate-pootle] Speeding up import of big projects

Nicolas François Tue, 26 Dec 2006 19:22:35 -0800

Hello,

I'm trying to find ideas to speed up the import of big projects.



My first idea is to reduce the time needed to generate the stats files.

I implemented a C program to generate the *.po.stats and
pootle-<project>-<language>.stats files.

The C implementation is much faster. It's drawback is that the pofilter
checks (for the *.po.stats) are not run until the file is modified.
The memory footprint is also very light (but PootleServer --refreshstats
also has a low (non-growing) memory footprint).

I would like to only generate the pootle-<project>-<language>.stats files,
and make Pootle only use these files for the project indexes, and let
Pootle generate the .po.stats files when an user enters a
project/directory/language directory (this lower the number of files
needed to analyze at a time). The *.po.stats could also be
generated/updated on a weekly basis.



Another idea was to avoid loading the PO files (just use the stats files).
This lower a lot the time needed to display a project for the first time
after the Pootle server is started. This also lower a lot the memory usage
of Pootle.
I added a loadpofiles argument to TranslationProject.scanpofiles() (which
default to False; this can be changed, I wanted to test the brutal way).
This requires loading the pofile in potimecache.__getitem__(), when needed.
(I also avoir calling scanpofiles in this __getitem__)


Another method which was amazingly slow is potree.hasgnufiles().
When the pootle server is just started, this function is called once on
the project directory and later 4 time for each languages.
An improvement was done by calling find instead of recursively analysing
the directory in python.
Another one was done by caching the result.
The algorithm looks OK, but maybe it can provide different results from
the current one.

[Note: I just made a test, if I request a language page for a project, I do
not use the cache, so the cache could be indexed by podir and language.

Note also that I'm not convinced that the algorithm when
languagecode==None and when a languagecode is provided should be
different]

Another function I could speed up is potree.getpofiles(). To retrieve the
list of PO files for a given language in a given project (for a GNU style
project), I also use find instead of walking through the directories with
Python.


Of course there are some drawbacks with these speedup:
 * the stats are not complete when generated with the small C file
 * the pofiles are not automatically loaded, thus they have to be loaded
   when an user ask for it.
 * using find is not portable (e.g. it won't work on Windows)

Maybe some of these could be make options in pootle.prefs.
For example, an admin can know that a given project is really big and thus
specify in pootle.prefs the type of project (GNU style or not), and
whether she want the files to be loaded on statup)


BTW, is there a way to tell a running Pootle server to reload files (some
specified files), without restarting Pootle.
In general, it would be nice to be able to tell Pootle "Hey, Pootle, I
will update xxx, forbid any access to it" and then "Hey Pootle, xxx was
updated, you can read it (and generate the stats)".
It would be nice when updating from CVS or svn.

Kind Regards,
-- 
Nekral

Index: Pootle/projects.py
===================================================================
--- Pootle/projects.py	(révision 42)
+++ Pootle/projects.py	(copie de travail)
@@ -61,7 +61,7 @@
     return True
 
 class potimecache(timecache.timecache):
-  """caches pootlefile objects, remembers time, and reverts back to statistics when neccessary..."""
+  """caches pootlefile objects, remembers time, and reverts back to statistics when necessary..."""
   def __init__(self, expiryperiod, project):
     """initialises the cache to keep objects for the given expiryperiod, and point back to the project"""
     timecache.timecache.__init__(self, expiryperiod)
@@ -72,8 +72,10 @@
     if key and not dict.__contains__(self, key):
       popath = os.path.join(self.project.podir, key)
       if os.path.exists(popath):
-        # update the index to pofiles...
-        self.project.scanpofiles()
+#        # update the index to pofiles...
+#        self.project.scanpofiles()
+#        Be lazy, just load what we need
+        self.project.pofiles[key] = pootlefile.pootlefile(self.project, key)
     return timecache.timecache.__getitem__(self, key)
 
   def expire(self, pofilename):
@@ -449,16 +451,17 @@
     goalnode.users = goalusers
     self.saveprefs()
 
-  def scanpofiles(self):
+  def scanpofiles(self, loadpofiles=False):
     """sets the list of pofilenames by scanning the project directory"""
     self.pofilenames = self.potree.getpofiles(self.languagecode, self.projectcode, poext=self.fileext)
-    for pofilename in self.pofilenames:
-      if not pofilename in self.pofiles:
-        self.pofiles[pofilename] = pootlefile.pootlefile(self, pofilename)
-    # remove any files that have been deleted since initialization
-    for pofilename in self.pofiles.keys():
-      if not pofilename in self.pofilenames:
-        del self.pofiles[pofilename]
+    if loadpofiles:
+      for pofilename in self.pofilenames:
+        if not pofilename in self.pofiles:
+          self.pofiles[pofilename] = pootlefile.pootlefile(self, pofilename)
+      # remove any files that have been deleted since initialization
+      for pofilename in self.pofiles.keys():
+        if not pofilename in self.pofilenames:
+          del self.pofiles[pofilename]
 
   def getuploadpath(self, dirname, pofilename):
     """gets the path of a po file being uploaded securely, creating directories as neccessary"""
@@ -1006,7 +1009,7 @@
               translatedwords, translated, fuzzywords, fuzzy, totalwords, total])
 
   def getquickstats(self, pofilenames=None):
-    """gets translated and total stats and wordcouts without doing calculations returning dictionary"""
+    """gets translated and total stats and wordcounts without doing calculations returning dictionary"""
     if pofilenames is None:
       pofilenames = self.pofilenames
     alltranslatedwords, alltranslated, allfuzzywords, allfuzzy, alltotalwords, alltotal = 0, 0, 0, 0, 0, 0
Index: Pootle/potree.py
===================================================================
--- Pootle/potree.py	(révision 42)
+++ Pootle/potree.py	(copie de travail)
@@ -41,6 +41,7 @@
     self.projects = instance.projects
     self.podirectory = instance.podirectory
     self.projectcache = {}
+    self.gnustylecache = {}
 
   def saveprefs(self):
     """saves any changes made to the preferences"""
@@ -305,7 +306,9 @@
   def isgnustyle(self, projectcode):
     """checks whether the whole project is a GNU-style project"""
     projectdir = os.path.join(self.podirectory, projectcode)
-    return self.hasgnufiles(projectdir)
+    if projectdir not in self.gnustylecache:
+      self.gnustylecache[projectdir] = self.hasgnufiles(projectdir)
+    return self.gnustylecache[projectdir]
 
   def addtranslationproject(self, languagecode, projectcode):
     """creates a new TranslationProject"""
@@ -355,39 +358,29 @@
 
   def hasgnufiles(self, podir, languagecode=None, depth=0, maxdepth=3, poext="po"):
     """returns whether this directory contains gnu-style PO filenames for the given language"""
-    #Let's check to see if we specifically find the correct gnu file
-    foundgnufile = False
+    if podir in self.gnustylecache:
+      return self.gnustylecache[podir]
+
     if not os.path.isdir(podir):
       return False
-    fnames = os.listdir(podir)
-    poext = os.extsep + "po"
-    subdirs = []
-    for fn in fnames:
-      if os.path.isdir(os.path.join(podir, fn)):
-        # if we have a language subdirectory, we're probably not GNU-style
-        if self.languagematch(languagecode, fn):
-          return False
-        #ignore hidden directories (like index directories)
-        if fn[0] == '.':
-          continue
-        subdirs.append(os.path.join(podir, fn))
-      elif fn.endswith(poext):
-        if self.languagematch(languagecode, fn[:-len(poext)]):
-          foundgnufile = True
-        elif not self.languagematch(None, fn[:-len(poext)]):
-          return "nongnu"
-    if depth < maxdepth:
-      for subdir in subdirs:
-        style = self.hasgnufiles(subdir, languagecode, depth+1, maxdepth)
-        if style == "nongnu":
-          return "nongnu"
-        if style == "gnu":
-          foundgnufile = True
-
-    if foundgnufile:
+    if languagecode == None:
+      languagecode = "[a-z]{2,3}"
+    # if we have a language subdirectory, we're probably not GNU-style
+    cmd='find %s -regextype posix-egrep -maxdepth %d -type d -regex ".*/%s(_[A-Z]{2,3})?" 2>/dev/null'%(podir, maxdepth, languagecode)
+    if len(os.popen(cmd).readlines()):
+      return False
+    # if we find a file with the given extension, but not named according to
+    # a language name, we're probably not GNU-style
+    cmd='find %s -regextype posix-egrep -maxdepth %d -type f -name "*.%s" -a \\( -regex ".*/[a-z]{2,3}(_[A-Z]{2,3})?\\.%s" -o -print \\) 2>/dev/null'%(podir, maxdepth, poext, poext)
+    if len(os.popen(cmd).readlines()):
+      return "nongnu"
+    # Otherwise, if we can find a file named according to a language name
+    # with the given extension, we're GNU-style
+    cmd='find %s -regextype posix-egrep -maxdepth %d -type f -regex ".*/%s(_[A-Z]{2,3})?\.%s" 2>/dev/null'%(podir, maxdepth, languagecode, poext)
+    if len(os.popen(cmd).readlines()):
       return "gnu"
-    else:
-      return ""
+    # Otherwise, we don't know
+    return ""
 
   def getcodesfordir(self, dirname):
     """returns projectcode and languagecode if dirname is a project directory"""
@@ -446,18 +439,22 @@
         basedirname = basedirname.replace(os.sep, "", 1)
       ponames = [fname for fname in fnames if fname.endswith(os.extsep+poext)]
       pofilenames.extend([os.path.join(basedirname, poname) for poname in ponames])
-    def addgnufiles(podir, dirname, fnames):
-      """adds the files to the set of files for this project"""
-      basedirname = dirname.replace(podir, "", 1)
-      while basedirname.startswith(os.sep):
-        basedirname = basedirname.replace(os.sep, "", 1)
-      ext = os.extsep + poext
-      ponames = [fn for fn in fnames if fn.endswith(ext) and self.languagematch(languagecode, fn[:-len(ext)])]
-      pofilenames.extend([os.path.join(basedirname, poname) for poname in ponames])
+#    def addgnufiles(podir, dirname, fnames):
+#      """adds the files to the set of files for this project"""
+#      basedirname = dirname.replace(podir, "", 1)
+#      while basedirname.startswith(os.sep):
+#        basedirname = basedirname.replace(os.sep, "", 1)
+#      ext = os.extsep + poext
+#      ponames = [fn for fn in fnames if fn.endswith(ext) and self.languagematch(languagecode, fn[:-len(ext)])]
+#      pofilenames.extend([os.path.join(basedirname, poname) for poname in ponames])
     pofilenames = []
     podir = self.getpodir(languagecode, projectcode)
     if self.hasgnufiles(podir, languagecode) == "gnu":
-      os.path.walk(podir, addgnufiles, podir)
+#      This is the generic way:
+#      os.path.walk(podir, addgnufiles, podir)
+#      But if we can use find, it is much much faster:
+      cmd = 'find %s \\( -name "%s.po" -o -name "%s_[A-Z][A-Z].po" -o -name "%s_[A-Z][A-Z][A-Z].po" \\) -printf "%%P\\n"'%(podir, languagecode, languagecode, languagecode)
+      pofilenames = [n[:-1] for n in os.popen(cmd).readlines()]
     else:
       os.path.walk(podir, addfiles, podir)
     return pofilenames

/**
 * Copyright (c) Nicolas François, 2006
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
 * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
 * IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
 * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
 * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
 * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 *
 * Compile with:
 * gcc -o gen_simplestats gen_simplestats.c -lgettextpo -lgettextsrc
 */
#include <gettext-po.h>
#include <errno.h> /* errno */
#include <stdio.h> /* printf */
#include <string.h> /* strtok */
#include <error.h> /* error */

//#define QUICKSTATS

#ifndef QUICKSTATS
/* stat */
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#endif

/* Maximum length of stat lines */
#define BUF_LEN 16384

extern void textmode_xerror (int severity,
                             po_message_t message,
                             const char *filename, size_t lineno, size_t column,
                             int multiline_p, const char *message_text);
extern void textmode_xerror2 (int severity,
                              po_message_t message1,
                              const char *filename1, size_t lineno1, size_t column1,
                              int multiline_p1, const char *message_text1,
                              po_message_t message2,
                              const char *filename2, size_t lineno2, size_t column2,
                              int multiline_p2, const char *message_text2);
struct po_xerror_handler default_xerror_handler={textmode_xerror,
                                                 textmode_xerror2};

int wordcount(const char *msg)
{
    int cnt = 0;
    char *tmp = strdup(msg);
    char *tmp_orig = tmp;
    tmp = strtok((char *)tmp, " \t\n\r");
    while (NULL != tmp)
    {
        cnt++;
        tmp = strtok(NULL, " \t\n\r");
    }
    free(tmp_orig);
    return cnt;
}

int main (int argc, char **argv)
{
    if (2 > argc)
    {
        fprintf (stderr, "Usage: %s <PO files>\n", argv[0]);
        return 1;
    }

    argv++;
    argc--;
    while (0 < argc)
    {
        int tmid = 0; /* # of translated messages */
        int fmid = 0; /* # of fuzzy messages */
        int tcnt = 0; /* # of words in the translated messages */
        int fcnt = 0; /* # of words in the fuzzy messages */
        int cnt = 0;  /* # of words */
        int n = -1;   /* index of the current message */
#ifndef QUICKSTATS
        struct stat s;
        char *tmpfilename;
        char total[BUF_LEN] = "total:";
        char blank[BUF_LEN] = "blank:";
        char translated[BUF_LEN] = "translated:";
        char fuzzy[BUF_LEN] = "fuzzy:";
        char untranslated[BUF_LEN] = "check-untranslated:";
        char msgidwordcounts[BUF_LEN] = "msgidwordcounts:";
        char msgstrwordcounts[BUF_LEN] = "msgstrwordcounts:";
        FILE *postats; /* file receiving the stats */
#endif

        const char *filename = argv[0];
#ifndef QUICKSTATS
        tmpfilename = malloc(strlen(filename)+10);
        if (NULL == tmpfilename)
        {
            perror ("malloc");
            exit (EXIT_FAILURE);
        }

        /* retrieve the modification time of the .po and .pending files */
        sprintf(tmpfilename, "%s.stats", filename);
        postats = fopen(tmpfilename, "w");
        if (NULL == postats)
        {
            perror("fopen");
            exit(EXIT_FAILURE);
        }
        /*   PO file modification time */
        if (0 != stat(filename, &s))
        {
            perror ("stat PO file");
            exit (EXIT_FAILURE);
        }
        fprintf(postats, "%ld", s.st_mtime);
        /*   pending file modification time */
        sprintf(tmpfilename, "%s.pending", filename);
        if (0 != stat(tmpfilename, &s))
        {
            /* Create the pending file if it does not exist */
            FILE *pending = fopen(tmpfilename, "w");
            fclose(pending);
            if (0 != stat(tmpfilename, &s))
            {
                perror ("stat2");
                exit (EXIT_FAILURE);
            }
        }
        fprintf(postats, " %ld\n", s.st_mtime);
#endif

        /* Now read the PO file and retrieve the stats */
        po_file_t file = po_file_read (filename, &default_xerror_handler);

        if (file == NULL)
            error (EXIT_FAILURE, errno, "couldn't open the PO file %s",
                                        filename);
        else
        {
            po_message_iterator_t iterator = po_message_iterator (file, NULL);
            for (;;)
            {
                po_message_t message = po_next_message (iterator);
                if (message == NULL)
                    break;
                else
                {
                    if (po_message_is_obsolete(message))
                    {
                        /* Do not use obsolete messages */
                        continue;
                    }
                    else
                    {
                        const char *msgid = po_message_msgid (message);
                        if (msgid[0] == '\0')
                        {
                            /* Header */
                            continue;
                        }
                        else
                        {
                            const char *msgstr = po_message_msgstr (message);
#ifndef QUICKSTATS
                            const char *msgstr_pl;
                            char bufcnt[32];
                            char bufcnt_str[32];
                            /* index of the plural translated string */
                            int msgstr_n = 1;
                            /* representation of the current index */
                            char bufid[32];
                            int wcnt_str = wordcount(msgstr);
#endif
                            int wcnt = wordcount(msgid);
                            n++;
#ifndef QUICKSTATS
                            sprintf(bufid, "%d,", n);
                            strncat(total, bufid, BUF_LEN-1);
#endif

                            /* # of words in this msgid */
                            if (NULL != (msgid = po_message_msgid_plural (message)))
                            {
                                int wcnt_pl = wordcount(msgid);
#ifndef QUICKSTATS
                                sprintf(bufcnt, "%d/%d,", wcnt, wcnt_pl);
#endif
                                wcnt += wcnt_pl;
                            }
#ifndef QUICKSTATS
                            else
                            {
                                sprintf(bufcnt, "%d,", wcnt);
                            }
                            strncat(msgidwordcounts, bufcnt, BUF_LEN-1);
#endif
                            cnt += wcnt;

#ifndef QUICKSTATS
                            /* # of words in the msgstr */
                            sprintf(bufcnt_str, "%d", wcnt_str);
                            strncat(msgstrwordcounts, bufcnt_str, BUF_LEN-1);
                            while (NULL != (msgstr_pl = po_message_msgstr_plural(message, msgstr_n)))
                            {
                                sprintf(bufcnt_str, "/%d", wordcount(msgstr_pl));
                                strncat(msgstrwordcounts, bufcnt_str, BUF_LEN-1);
                                msgstr_n++;
                            }
                            strncat(msgstrwordcounts, ",", BUF_LEN-1);
#endif

                            /* classify */
                            if (msgstr[0] == '\0')
                            {
                                /* Untranslated */
#ifndef QUICKSTATS
                                strncat(untranslated, bufid, BUF_LEN-1);
                                strncat(blank, bufid, BUF_LEN-1);
#endif
                            }
                            if (po_message_is_fuzzy(message))
                            {
                                /* Fuzzy */
                                fmid++;
                                fcnt+=wcnt;
#ifndef QUICKSTATS
                                strncat(fuzzy, bufid, BUF_LEN-1);
#endif
                            }
                            else if (msgstr[0] != '\0')
                            {
                                /* Translated */
                                tmid++;
                                tcnt+=wcnt;
#ifndef QUICKSTATS
                                strncat(translated, bufid, BUF_LEN-1);
#endif
                            }
                        }
                    }
                }
            }
            po_message_iterator_free (iterator);
        }
        po_file_free (file);

#ifndef QUICKSTATS
        /* Remove the ending commas */
        if (total[strlen(total)-1] == ',')
            total[strlen(total)-1] = '\0';
        if (translated[strlen(translated)-1] == ',')
            translated[strlen(translated)-1] = '\0';
        if (untranslated[strlen(untranslated)-1] == ',')
            untranslated[strlen(untranslated)-1] = '\0';
        if (blank[strlen(blank)-1] == ',')
            blank[strlen(blank)-1] = '\0';
        if (fuzzy[strlen(fuzzy)-1] == ',')
            fuzzy[strlen(fuzzy)-1] = '\0';
        if (msgidwordcounts[strlen(msgidwordcounts)-1] == ',')
            msgidwordcounts[strlen(msgidwordcounts)-1] = '\0';
        if (msgstrwordcounts[strlen(msgstrwordcounts)-1] == ',')
            msgstrwordcounts[strlen(msgstrwordcounts)-1] = '\0';

        /* Print the stats */
        fprintf(postats, "%s\n", blank);
        fprintf(postats, "%s\n", total);
        fprintf(postats, "%s\n", fuzzy);
        fprintf(postats, "%s\n", untranslated);
        fprintf(postats, "%s\n", translated);
        fprintf(postats, "%s\n", msgidwordcounts);
        fprintf(postats, "%s\n", msgstrwordcounts);
        fclose(postats);
#endif
        /* Print the quick stats on stdout */
        printf("%s, %d, %d, %d, %d, %d, %d\n",
               filename,
               tcnt, tmid,
               fcnt, fmid,
               cnt, n+1);

        /* Next PO file */
        argv++;
        argc--;
    }

    return 0;
}

gen_simplestats.sh
Description: Bourne shell script

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Translate-pootle mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/translate-pootle

[translate-pootle] Speeding up import of big projects

Reply via email to