Hi all, I've been playing with substituting the two word break algorithms in libtracker-fts (custom for non-CJK and pango-based for CJK) with a single one using GNU libunistring (LGPLv3). Note that libicu (ICU license) is also probably a good choice instead of libunistring. http://www.gnu.org/software/libunistring http://site.icu-project.org
The first impressions have been good so far. I did some functionality tests and some performance tests, and at least in my tests in both cases the libunistring-based parser seems to behave better. Would be glad if someone else than me does some performance tests to compare, anyway. You can get the "parser-libunistring-review" branch from git.gnome.org. The libunistring-based parser can be enabled configuring with --enable-libunistring or automatically if libunistring-dev package is installed. If you don't do any of these two previous things, the old parser will be used. Not sure anyway, if it's desirable to have this option if the libunistring stuff is accepted. Performance (single-cpu Celeron) ---------------------------------- 1) Recursively indexing /usr/src/linux-headers-2.6.32-21 (73 Mbytes, 23410 processed & unprocessed files) With the current parser, it took 21m (0.0538s/file) With the libunistring-based parser, it took 19m26s (0.0498s/file) 2) Recursively indexing libunistring's sources (1425 processed files) With the current parser, it took 340s (0.239s/file) With the libunistring-based parser, it took 185s (0.130/file) 3) Memory usage tests Total memory usage of results reported by valgrind in tracker-store where quite similar in my tests, maybe a little bit bigger in the libunistring-based parser, but probably because more words where processed (see Pending Issues below). I tried to use as many stack-buffers as possible, so for the same words processed, the libunistring-based parser should be more heap-usage friendly. Philip, could you check anyway how's the memory fragmentation? Functionality issues ---------------------------------- 1) https://bugzilla.gnome.org/show_bug.cgi?id=579756 Current version fails to properly detect word boundaries if NFD-normalized, as it considers combining marks as word-breakers. Thus, when looking for "école" (NFC) or "ecole", tracker will only report files with the NFC-version of école, and not those with the NFD-version of the word. With the libunistring-based parser, all searches return the proper results, no matter if NFC or NFD. 2) Case folding for string comparison is also broken in current version. Just did a simple test with two text files, one with the german word "groß" and another one with the lowercase version "gross". A proper search should return both results if searching either for one or the other, but the current version fails to return the "gross" result when looking for "groß". The issue comes because in the non-cjk parser, strings are lowercased character by character, and the proper thing would be to do case-folding in the entire word at once. With the libunistring-based parser, all searches return the proper results, and even better because the method doing case-folding also does normalization in the same run if desired. 3) Word-breaking. With the current parser, mixed CJK and non-CJK strings are either parsed with the CJK-parser or with the non-CJK parser, which is probably not a good way (choice between either one or the other is done looking at the first characters in the string). With the libunistring-based parser, word breaking is always done using the same algorithm for all CJK, non-CJK and mixed CJK/non-CJK strings. 4) NFC Normalization & UNAC UNAC does not work with NFD-normalized strings. Thus, NFC normalization needs to be before UNAC always. The current parser doesn't do this, so even if the word-break algorithm was corrected to fix the "école" issue, UNAC would still not strip the accent from the word if it comes NFD-normalized. Fixed this already in the libunistring-based parser (as said before, the case-folding method in libunistring allows giving a NFC normalized output directly, so no need for g_utf8_normalize()). Pending issues ---------------------------------- 1) The current non-CJK word-break algorithm assumes that a word starts either with a letter, a number or a underscore (correct me if wrong, please). Not sure why the underscore, but anyway in the libunistring-based parser I also included any symbol as a valid word starter character. This actually means that lots of new words are being considered, specially if parsing source code (like '+', '-' and such). Probably symbols should be removed from the list of valid word starter characters, so suggestions welcome. 2) UNAC needs NFC input, but the output of UNAC is not NFC, it's the unaccented string in NFKD normalization. I avoided an extra normalization back to NFC, but not sure how it should go. This applies to both non-libunistring and libunistring versions of the parser. 3) libunistring currently finds all word breaks in the whole input string in a single function call. This could be improved so that words are found one by one, which allows stopping the word-break operation at any time. Already asked this in libunistring mailing list and the author added it in his TODO list. Comments welcome! -- Aleksander _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
