Hi all again, > > I've been playing with substituting the two word break algorithms in > libtracker-fts (custom for non-CJK and pango-based for CJK) with a > single one using GNU libunistring (LGPLv3). Note that libicu (ICU > license) is also probably a good choice instead of libunistring. > http://www.gnu.org/software/libunistring > http://site.icu-project.org >
I developed the libicu-based parser using its unicode algorithms for word-breaking, normalization and such, as I did for GNU libunistring last week; and made some tests to compare all three of the implementations (libunistring-based, libicu-based, glib/pango-based). You can get the changes from the 'parser-unicode-libs-review' branch in gnome git. I added a new option in configure to be able to select the desired unicode support library: --with-unicode-support=[glib|libunistring|libicu] Currently it defaults to 'glib' if not specified. Also developed a tester which uses the parser in libtracker-fts, available in tests/libtracker-fts/tracker-parser-test.c Once compiled, you can use --file to specify the file to parse. I did several tests using the new tester, which seem to be more accurate than the first tests I did last week, as in these new ones the results only depend on the parser implementation, and not on the miner-fs for example. Attached is a short spreadsheet with some numbers I got using my set of test files. I measured three different things: * The time it takes for each parser to parse each file. * The number of words obtained with each parser in each file. * The contents of the output words. All the result files are available at: http://www.lanedo.com/~aleksander/gnome-tracker/tracker-parser-unicode-libraries/ Some conclusions from the tests 1) Both libunistring and libicu based parsers have exactly the same output in all tests: same number of words, same word contents. 2) The number of words detected by the glib(custom/pango) parser and their contents are usually completely different than the number of words detected by the others: * In a chinese-only file, for example, while libunistring/libicu both detect 1202 words, the glib(custom/pango) parser detects only 188. * In a file with mixed languages, glib(custom/pango) detects 22105 words while the others detect 33472 words. 3) GNU libunistring seems to be around 9%-10% faster than libicu (probably because of the conversions to/from UChars, which are UTF-16 encoded strings. libunistring's API can work directly with UTF-8). This comparison is very realistic considering that both parsers have exactly the same output results. 4) glib(custom/pango) time results are almost all of them better than the ones from libunistring/libicu. This is not surprising as the number of words detected by glib parser are much less. Thus, these timing values cannot really be compared. 5) Pango-based word break is really slow. In a 180k mixed-language file: * libunistring needed 1.01 seconds * libicu needed 1.10 seconds * glib(pango) needed 22 seconds! 6) More situations where glib(custom/pango) parser doesn't work properly: * When input string is decomposed (NFD) (as with the "école issue" in the testcaseNFD.txt file in the tests) * Special case-folding cases (as with the "groß/gross issue" in the gross-1.txt file in the tests) Both libunistring and libicu behave perfectly in the previous cases. Finally, I re-paste the pending issues, as they still are the same: > > Pending issues > ---------------------------------- > 1) The current non-CJK word-break algorithm assumes that a word starts > either with a letter, a number or a underscore (correct me if wrong, > please). Not sure why the underscore, but anyway in the > libunistring-based parser I also included any symbol as a valid word > starter character. This actually means that lots of new words are being > considered, specially if parsing source code (like '+', '-' and such). > Probably symbols should be removed from the list of valid word starter > characters, so suggestions welcome. > Now applies to both libunistring and libicu based parsers. > 2) UNAC needs NFC input, but the output of UNAC is not NFC, it's the > unaccented string in NFKD normalization. I avoided an extra > normalization back to NFC, but not sure how it should go. This applies > to both non-libunistring and libunistring versions of the parser. Applies to all 3 parsers. > > 3) libunistring currently finds all word breaks in the whole input > string in a single function call. This could be improved so that words > are found one by one, which allows stopping the word-break operation at > any time. Already asked this in libunistring mailing list and the author > added it in his TODO list. > Applies still to libunistring. libicu already can do a one-by-one word search (with UChars). Comments welcome, -- Aleksander
unicode-libraries-report.ods
Description: application/vnd.oasis.opendocument.spreadsheet
_______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
