> > I really wouldn't split between non-CJK and CJK, if the performance of > > ASCII is comparable using libunistring/libicu (which it seems it is). > > we cant be sure of that until you add the extra word discrimination to > your unicode versions so that output of all is equal (barring bugs with > normalizations!). Also try benchmarking with removal of the precheck for > encoding from tracker as its very likely we will ditch pango and by > doing so we could be much more dynamic with how we deal with words. I > would be very surprised if those unicode libs could match tracker on > straight ASCII without the precheck! >
Oh, wait, but the current glib/pango doesn't do the split between ASCII and non-ASCII. It's doing it between CJK and non-CJK. I agree in that some ASCII-only improvements could be really useful in our case. But ASCII-only, not non-CJK. The initial NFC normalization fix for glib/pango parser is really not trivial if we need to keep the offsets of the bytes in the original string, but will try to think about it. And then it remains the issue with the case-folding, which shouldn't be done unichar by unichar in non-ASCII (including Latin encodings). Thus, a real comparison for all cases between the three parsers would need time. But just did an ASCII-only comparison, as all three parsers return same outputs in this case. > > > The best thing of libunistring/libicu based parsers is really that there > > is a single algorithm for any string, whatever characters they have, and > > maintaining such algorithms should be trivial compared to the glib/pango > > case. > > > Also, the split algorithm for non-CJK and CJK would again be faulty for > > documents with strings in both English and Chinese for example. Probably > > not the case in my computer or yours, but a really high chance in a > > Japanese's or Chinese's computer. > > > > Anyway, tomorrow I will spend some time doing additional tests for the > > ASCII-only case, and will try to compare the three parsers in this > > specific situation. > > Great look forward to it! > Using a 50k lorem-ipsum file, plain ASCII with whitespace separators and other punctuation marks, removing the g_prints I had before, got the following results (averages of several runs): * libicu --> 0.140 seconds * libunistring --> 0.136 seconds * glib(custom) --> 0.135 seconds With a 200k lorem-ipsum file: * libicu --> 0.384 seconds * libunistring --> 0.358 seconds * glib(custom) --> 0.345 seconds So for the ASCII-7 only case, the custom algorithm performs a little bit better. I will modify the libunistring and libicu based algorithms tomorrow so that if ASCII-7 only, normalization and casefolding is not done, just a tolower() of each character. That would make the values more approximate to the glib/custom parser. But again, this would be an improvement for "ASCII-only" (equivalent to not doing UNAC stripping for ASCII) not for "non-CJK", as any other Latin encoding needs proper normalization and case-folding. More tomorrow :-) Cheers! _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
