> > > > I think it is silly to use 2 different libraries to do the same thing > > and if one does things better than another... > > Its way too slow to use CJK breaking on non-CJK text - really the parser > checks the language before using the appropriate algorithm. The > extractor lacks the intelligence to do it efficiently >
It's probably wrong to just assume CJK-word-breaking and non-CJK-word-breaking. What if the input string has mixed CJK and latin characters? > > >>> IMHO, The tracker_text_normalize() in the extractor should just do utf8 > > >>> validation. It should not attempt word breaking as thats cpu expensive > > >>> and being done by the parser already > > > > Well, extraction already is pretty expensive. I see your point there but > > also, it doesn't make sense to send n bytes over d-bus that won't be > > used either. So really it is the lesser of two evils. Currently we do > > push a lot of data over d-bus. > > sure its a trade off > > I just think word limits should be estimated or ignored in the > extractors (we have a byte limit as well as a word limit in any event) > Regarding the word-break in the extraction, it was agreed not to do it and apply just a max-bytes limit in the extractors: https://bugzilla.gnome.org/show_bug.cgi?id=616845 Cheers! -- Aleksander _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
