Hi Jamie, > A few comments i would ask (I have only looked at your unicode parsers > in the libtracker-fts directory in your branch so apologies if my > assumptions are wrong): > > 1) I assume the glib parser in your benchmarks is the tracker parser > unmodified? >
Yes, didn't touch it. > 2) Tracker parser ignores words that start with numbers or odd > characters (only a..z/A..Z or underscore is allowed for first character > - the latter so that c function names get indexed). This keeps out a lot > of useless junk from entering the FTS index and will almost certainly > account for the discrepencies in word counts (including using pango) in > your benchmarks? > > (I see from your comments you allow words beginning with numbers in your > unicode implmentations) Yes and no, I would say. I enabled number-only words, thinking in this: https://bugzilla.gnome.org/show_bug.cgi?id=503366 Of course, that could be something configurable if needed. Some of the discrepancies in the word counts will probably come from this allowance of words starting with numbers, and also some from the allowance of all symbols as word-starters. But there are still some issues with wrong word-breaks if input text comes decomposed in NFD form. The glib-based parser would need to be modified so that NFC normalization is done just when the string is set in the parser, but that is quite difficult assuming that the start/end offsets of the original words need to be preserved for the offsets() and snippet() FTS methods. Didn't find any normalization method which stores the original offsets. > > 3) UNAC benchmarking would also make sense as it converts to UTF16 to > perform accent stripping. Of course if word breaking is faster in UTF16 > then it might give your unicode libs some advantage in the benchmarks? > Well, the libunistring-parser uses exactly the same unaccent method than the glib-parser, as both have UTF-8 as input and output. UNAC processing will probably be faster in libicu, as in this case the UChars passed as input to the unaccent method are already in UTF-16, so only a conversion to UTF-16BE (ensuring always big-endian way of UTF-16 for libunac) is needed. So basically, for the benchmarking UNAC is not really an issue, I would say. > 4) I personally feel that whatever parser we use, it should perform > optimally for ascii as its more prevalent in source code and indexing > source code is really cpu intensive. We could of course use a unicode > lib for non-ascii stuff. I note you include some ascii checking in your > unicode stuff but its not used for word breaking but for UNAC > eligibility and it causes an additional iteration of the characters in > the word (the tracker one tests for ascii whilst doing word breaking > iteration) > Yes, you are right, the extra ASCII check is not needed in the original version, but it was really needed to improve the performance of the unicode-based parsers for the cases where UNAC stripping was not needed. But apart from that, the performance difference between the glib-parser tests and the unicode-based-parsers are really not comparable: If all processed the same number of words, it really seems that both libunistring-based one and libicu-based one would behave better even for ASCII, and all the normalization and case-folding issues would be solved, and using a single implementation for any kind of input string (even with mixed CJK and non-CJK). Cheers! _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
