Hi Jamie, > > word break detection is done in > http://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-parser.c > > THis is highly optimised and does checks for Plain ASCII/Latin/CJK > encodings to determine which word breaking algorithm to use > > For CJK we always use pango to word break as this is believed to be > correct (although too slow to use for non-CJK) > > I dont know why tracker_text_normalize() exists or why its used instead > of the above but clearly if the tracker-parser one is correct then it > should be using that one. (the parser also does NFC normalization) >
tracker_text_normalize() (in libtracker-extractor) is not actually doing any Unicode normalization so sorry for the confusion (actually the method name is quite confusing as well). Currently, it's doing these two things: * Performing a simple word-break algorithm only working properly with ASCII/Latin encodings. This is used to count the number of words being extracted from the document, so that it can be limited to the MaxWordsToIndex conf parameter in tracker-fts.cfg * Removes almost all formatting from the incoming text, leaving the extracted text as a whitespace-separated list of words. > Of course I cant understand why normalization needs to be done prior to > the parsing - surely only utf8 validation needs doing there (re > normalizing just wastes cpu) > Yes, of course normalizing twice is not a good idea. Regarding normalization, I just saw that if the original text comes in decomposed way, the current tracker_text_normalize() would actually be removing all combining characters. For example, following the bug report, if the incoming string has the word "école" coming in decomposed way: "école" (U+0065 U+0301 U+0063 U+006F U+006C U+0065) The output of tracker_text_normalize() will be that it incorrectly found 2 words: "e" (U+0065) and "cole" (U+0063 U+006F U+006C U+0065) because the U+0301 combining class character was taken as a word-break and substituted by a whitespace. During extraction it makes sense to be limiting the incoming text size, either by counting the amount of words coming (thus, using some algorithm that makes word-breaks properly) or just by the number of bytes of the incoming text. Both limits are currently being applied to most text extractors. Maybe it's just a matter of removing the limit of word counts during extraction, if it's being done also afterwards? But then there's the second issue with tracker_text_normalize() removing all formatting from the input text. Shouldn't it then avoid that, and just insert the contents as they originally came in the document? This is, with commas, semicolons, question marks, newline characters... Cheers, -Aleksander _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
