Just to make it clear, I believe the tracker way of iterating over the
characters is the optimal solution as it allows us to detect encoding,
keep track of length, perform word validation (aforementioned parse
rules concerning starting character)  and perform word breaking without
any further character iterations (unlike the unicode implementations)

Therefore one question I would ask if the unicode libs provide ability
to check if a single character is a word break? 

If so it could be added to the existing tracker parser with little
modification and overhead (and only used if word is non-ascii of
course!)

Also the pango way of allocation structs to store attributes for each
character is indeed crazy and insanely slow but fortunatley we only use
that for CJK text but in any case we could indeed optimize that area
with some of the work you have done

jamie

On Tue, 2010-05-04 at 15:06 -0400, Jamie McCracken wrote:
> Thanks for your efforts on this - its very interesting and appreciated
> 
> A few comments i would ask (I have only looked at your unicode parsers
> in the libtracker-fts directory in your branch  so apologies if my
> assumptions are wrong):
> 
> 1) I assume the glib parser in your benchmarks is the tracker parser
> unmodified?
> 
> 2) Tracker parser ignores words that start with numbers or odd
> characters (only a..z/A..Z or underscore is allowed for first character
> - the latter so that c function names get indexed). This keeps out a lot
> of useless junk from entering the FTS index and will almost certainly
> account for the discrepencies in word counts (including using pango) in
> your benchmarks?
> 
> (I see from your comments you allow words beginning with numbers in your
> unicode implmentations)
> 
> 3) UNAC benchmarking would also make sense as it converts to UTF16 to
> perform accent stripping. Of course if word breaking is faster in UTF16
> then it might give your unicode libs some advantage in the benchmarks?
> 
> 4) I personally feel that whatever parser we use, it should perform
> optimally for ascii as its more prevalent in source code and indexing
> source code is really cpu intensive. We could of course use a unicode
> lib for non-ascii stuff. I note you include some ascii checking in your
> unicode stuff but its not used for word breaking but for UNAC
> eligibility and it causes an additional iteration of the characters in
> the word (the tracker one tests for ascii whilst doing word breaking
> iteration)
> 
> jamie
> 
> 
> On Tue, 2010-05-04 at 20:11 +0200, Aleksander Morgado wrote:
> > Hi all again,
> > 
> > > 
> > > I've been playing with substituting the two word break algorithms in
> > > libtracker-fts (custom for non-CJK and pango-based for CJK) with a
> > > single one using GNU libunistring (LGPLv3). Note that libicu (ICU
> > > license) is also probably a good choice instead of libunistring.
> > > http://www.gnu.org/software/libunistring
> > > http://site.icu-project.org
> > > 
> > 
> > I developed the libicu-based parser using its unicode algorithms for
> > word-breaking, normalization and such, as I did for GNU libunistring
> > last week; and made some tests to compare all three of the
> > implementations (libunistring-based, libicu-based, glib/pango-based).
> > 
> > You can get the changes from the 'parser-unicode-libs-review' branch in
> > gnome git.
> > 
> > I added a new option in configure to be able to select the desired
> > unicode support library:
> > --with-unicode-support=[glib|libunistring|libicu]
> > Currently it defaults to 'glib' if not specified.
> > 
> > Also developed a tester which uses the parser in libtracker-fts,
> > available in tests/libtracker-fts/tracker-parser-test.c
> > Once compiled, you can use --file to specify the file to parse.
> > 
> > I did several tests using the new tester, which seem to be more accurate
> > than the first tests I did last week, as in these new ones the results
> > only depend on the parser implementation, and not on the miner-fs for
> > example.
> > 
> > Attached is a short spreadsheet with some numbers I got using my set of
> > test files. I measured three different things:
> >  * The time it takes for each parser to parse each file.
> >  * The number of words obtained with each parser in each file.
> >  * The contents of the output words.
> > 
> > All the result files are available at:
> > http://www.lanedo.com/~aleksander/gnome-tracker/tracker-parser-unicode-libraries/
> > 
> > Some conclusions from the tests
> > 
> > 1) Both libunistring and libicu based parsers have exactly the same
> > output in all tests: same number of words, same word contents.
> > 
> > 2) The number of words detected by the glib(custom/pango) parser and
> > their contents are usually completely different than the number of words
> > detected by the others:
> >  * In a chinese-only file, for example, while libunistring/libicu both
> > detect 1202 words, the glib(custom/pango) parser detects only 188.
> >  * In a file with mixed languages, glib(custom/pango) detects 22105
> > words while the others detect 33472 words.
> > 
> > 3) GNU libunistring seems to be around 9%-10% faster than libicu
> > (probably because of the conversions to/from UChars, which are UTF-16
> > encoded strings. libunistring's API can work directly with UTF-8). This
> > comparison is very realistic considering that both parsers have exactly
> > the same output results.
> > 
> > 4) glib(custom/pango) time results are almost all of them better than
> > the ones from libunistring/libicu. This is not surprising as the number
> > of words detected by glib parser are much less. Thus, these timing
> > values cannot really be compared.
> > 
> > 5) Pango-based word break is really slow. In a 180k mixed-language file:
> >  * libunistring needed 1.01 seconds
> >  * libicu needed 1.10 seconds
> >  * glib(pango) needed 22 seconds!
> > 
> > 6) More situations where glib(custom/pango) parser doesn't work
> > properly:
> >  * When input string is decomposed (NFD) (as with the "école issue" in
> > the testcaseNFD.txt file in the tests)
> >  * Special case-folding cases (as with the "groß/gross issue" in the
> > gross-1.txt file in the tests)
> > Both libunistring and libicu behave perfectly in the previous cases.
> > 
> > Finally, I re-paste the pending issues, as they still are the same:
> > 
> > > 
> > > Pending issues
> > > ----------------------------------
> > > 1) The current non-CJK word-break algorithm assumes that a word starts
> > > either with a letter, a number or a underscore (correct me if wrong,
> > > please). Not sure why the underscore, but anyway in the
> > > libunistring-based parser I also included any symbol as a valid word
> > > starter character. This actually means that lots of new words are being
> > > considered, specially if parsing source code (like '+', '-' and such).
> > > Probably symbols should be removed from the list of valid word starter
> > > characters, so suggestions welcome.
> > > 
> > 
> > Now applies to both libunistring and libicu based parsers.
> > 
> > > 2) UNAC needs NFC input, but the output of UNAC is not NFC, it's the
> > > unaccented string in NFKD normalization. I avoided an extra
> > > normalization back to NFC, but not sure how it should go. This applies
> > > to both non-libunistring and libunistring versions of the parser.
> > 
> > Applies to all 3 parsers.
> > 
> > > 
> > > 3) libunistring currently finds all word breaks in the whole input
> > > string in a single function call. This could be improved so that words
> > > are found one by one, which allows stopping the word-break operation at
> > > any time. Already asked this in libunistring mailing list and the author
> > > added it in his TODO list.
> > > 
> > 
> > Applies still to libunistring. libicu already can do a one-by-one word
> > search (with UChars).
> > 
> > 
> > Comments welcome,
> > 
> > _______________________________________________
> > tracker-list mailing list
> > [email protected]
> > http://mail.gnome.org/mailman/listinfo/tracker-list

_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to