Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)

Aleksander Morgado Tue, 04 May 2010 13:14:19 -0700

Hi Jamie,

> A few comments i would ask (I have only looked at your unicode parsers
> in the libtracker-fts directory in your branch  so apologies if my
> assumptions are wrong):
> 
> 1) I assume the glib parser in your benchmarks is the tracker parser
> unmodified?
>


Yes, didn't touch it.

> 2) Tracker parser ignores words that start with numbers or odd
> characters (only a..z/A..Z or underscore is allowed for first character
> - the latter so that c function names get indexed). This keeps out a lot
> of useless junk from entering the FTS index and will almost certainly
> account for the discrepencies in word counts (including using pango) in
> your benchmarks?
> 
> (I see from your comments you allow words beginning with numbers in your
> unicode implmentations)

Yes and no, I would say. I enabled number-only words, thinking in this:
https://bugzilla.gnome.org/show_bug.cgi?id=503366

Of course, that could be something configurable if needed.

Some of the discrepancies in the word counts will probably come from
this allowance of words starting with numbers, and also some from the
allowance of all symbols as word-starters.

But there are still some issues with wrong word-breaks if input text
comes decomposed in NFD form. The glib-based parser would need to be
modified so that NFC normalization is done just when the string is set
in the parser, but that is quite difficult assuming that the start/end
offsets of the original words need to be preserved for the offsets() and
snippet() FTS methods. Didn't find any normalization method which stores
the original offsets.

> 
> 3) UNAC benchmarking would also make sense as it converts to UTF16 to
> perform accent stripping. Of course if word breaking is faster in UTF16
> then it might give your unicode libs some advantage in the benchmarks?
> 

Well, the libunistring-parser uses exactly the same unaccent method than
the glib-parser, as both have UTF-8 as input and output. UNAC processing
will probably be faster in libicu, as in this case the UChars passed as
input to the unaccent method are already in UTF-16, so only a conversion
to UTF-16BE (ensuring always big-endian way of UTF-16 for libunac) is
needed. So basically, for the benchmarking UNAC is not really an issue,
I would say.

> 4) I personally feel that whatever parser we use, it should perform
> optimally for ascii as its more prevalent in source code and indexing
> source code is really cpu intensive. We could of course use a unicode
> lib for non-ascii stuff. I note you include some ascii checking in your
> unicode stuff but its not used for word breaking but for UNAC
> eligibility and it causes an additional iteration of the characters in
> the word (the tracker one tests for ascii whilst doing word breaking
> iteration)
> 

Yes, you are right, the extra ASCII check is not needed in the original
version, but it was really needed to improve the performance of the
unicode-based parsers for the cases where UNAC stripping was not needed.

But apart from that, the performance difference between the glib-parser
tests and the unicode-based-parsers are really not comparable: If all
processed the same number of words, it really seems that both
libunistring-based one and libicu-based one would behave better even for
ASCII, and all the normalization and case-folding issues would be
solved, and using a single implementation for any kind of input string
(even with mixed CJK and non-CJK).

Cheers!



_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)

Reply via email to