Hi all again,

> 
> I've been playing with substituting the two word break algorithms in
> libtracker-fts (custom for non-CJK and pango-based for CJK) with a
> single one using GNU libunistring (LGPLv3). Note that libicu (ICU
> license) is also probably a good choice instead of libunistring.
> http://www.gnu.org/software/libunistring
> http://site.icu-project.org
> 

I developed the libicu-based parser using its unicode algorithms for
word-breaking, normalization and such, as I did for GNU libunistring
last week; and made some tests to compare all three of the
implementations (libunistring-based, libicu-based, glib/pango-based).

You can get the changes from the 'parser-unicode-libs-review' branch in
gnome git.

I added a new option in configure to be able to select the desired
unicode support library:
--with-unicode-support=[glib|libunistring|libicu]
Currently it defaults to 'glib' if not specified.

Also developed a tester which uses the parser in libtracker-fts,
available in tests/libtracker-fts/tracker-parser-test.c
Once compiled, you can use --file to specify the file to parse.

I did several tests using the new tester, which seem to be more accurate
than the first tests I did last week, as in these new ones the results
only depend on the parser implementation, and not on the miner-fs for
example.

Attached is a short spreadsheet with some numbers I got using my set of
test files. I measured three different things:
 * The time it takes for each parser to parse each file.
 * The number of words obtained with each parser in each file.
 * The contents of the output words.

All the result files are available at:
http://www.lanedo.com/~aleksander/gnome-tracker/tracker-parser-unicode-libraries/

Some conclusions from the tests

1) Both libunistring and libicu based parsers have exactly the same
output in all tests: same number of words, same word contents.

2) The number of words detected by the glib(custom/pango) parser and
their contents are usually completely different than the number of words
detected by the others:
 * In a chinese-only file, for example, while libunistring/libicu both
detect 1202 words, the glib(custom/pango) parser detects only 188.
 * In a file with mixed languages, glib(custom/pango) detects 22105
words while the others detect 33472 words.

3) GNU libunistring seems to be around 9%-10% faster than libicu
(probably because of the conversions to/from UChars, which are UTF-16
encoded strings. libunistring's API can work directly with UTF-8). This
comparison is very realistic considering that both parsers have exactly
the same output results.

4) glib(custom/pango) time results are almost all of them better than
the ones from libunistring/libicu. This is not surprising as the number
of words detected by glib parser are much less. Thus, these timing
values cannot really be compared.

5) Pango-based word break is really slow. In a 180k mixed-language file:
 * libunistring needed 1.01 seconds
 * libicu needed 1.10 seconds
 * glib(pango) needed 22 seconds!

6) More situations where glib(custom/pango) parser doesn't work
properly:
 * When input string is decomposed (NFD) (as with the "école issue" in
the testcaseNFD.txt file in the tests)
 * Special case-folding cases (as with the "groß/gross issue" in the
gross-1.txt file in the tests)
Both libunistring and libicu behave perfectly in the previous cases.

Finally, I re-paste the pending issues, as they still are the same:

> 
> Pending issues
> ----------------------------------
> 1) The current non-CJK word-break algorithm assumes that a word starts
> either with a letter, a number or a underscore (correct me if wrong,
> please). Not sure why the underscore, but anyway in the
> libunistring-based parser I also included any symbol as a valid word
> starter character. This actually means that lots of new words are being
> considered, specially if parsing source code (like '+', '-' and such).
> Probably symbols should be removed from the list of valid word starter
> characters, so suggestions welcome.
> 

Now applies to both libunistring and libicu based parsers.

> 2) UNAC needs NFC input, but the output of UNAC is not NFC, it's the
> unaccented string in NFKD normalization. I avoided an extra
> normalization back to NFC, but not sure how it should go. This applies
> to both non-libunistring and libunistring versions of the parser.

Applies to all 3 parsers.

> 
> 3) libunistring currently finds all word breaks in the whole input
> string in a single function call. This could be improved so that words
> are found one by one, which allows stopping the word-break operation at
> any time. Already asked this in libunistring mailing list and the author
> added it in his TODO list.
> 

Applies still to libunistring. libicu already can do a one-by-one word
search (with UChars).


Comments welcome,

-- 
Aleksander

Attachment: unicode-libraries-report.ods
Description: application/vnd.oasis.opendocument.spreadsheet

_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to