On Tue, 2010-05-04 at 23:08 +0200, Aleksander Morgado wrote:

> Anyway I agree that the fastest and perfect solution would be the one
> doing all the needed things in a single iteration: NFC normalization,
> word-break detection, a proper case-folding (not
> character-per-character!)... even accent stripping and stemming could be
> done if we were to develop such function (and that would really actually
> be a great performance improvement, btw), but that is probably a huge
> work only useful for the Tracker case, and very difficult to maintain.

True but as its likely to be the most cpu intensive part of tracker, a
small gain will have a significant effect

[snip]
> 
> I really wouldn't split between non-CJK and CJK, if the performance of
> ASCII is comparable using libunistring/libicu (which it seems it is).

we cant be sure of that until you add the extra word discrimination to
your unicode versions so that output of all is equal (barring bugs with
normalizations!). Also try benchmarking with removal of the precheck for
encoding from tracker as its very likely we will ditch pango and by
doing so we could be much more dynamic with how we deal with words. I
would be very surprised if those unicode libs could match tracker on
straight ASCII without the precheck!  


> The best thing of libunistring/libicu based parsers is really that there
> is a single algorithm for any string, whatever characters they have, and
> maintaining such algorithms should be trivial compared to the glib/pango
> case.

> Also, the split algorithm for non-CJK and CJK would again be faulty for
> documents with strings in both English and Chinese for example. Probably
> not the case in my computer or yours, but a really high chance in a
> Japanese's or Chinese's computer.
> 
> Anyway, tomorrow I will spend some time doing additional tests for the
> ASCII-only case, and will try to compare the three parsers in this
> specific situation.

Great look forward to it!

thanks

jamie

_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to