On Sun, 2010-04-25 at 22:34 +0200, Aleksander Morgado wrote: > Hi Jamie, > > > > I think it makes sense to fix this. Just to be clear, does this mean we > > > don't need Pango in libtracker-fts/tracker-parser.c to determine word > > > breaks for CJK? > > > > Thats not broken so would not recommend trying to "fix" that > > > > IMHO, The tracker_text_normalize() in the extractor should just do utf8 > > validation. It should not attempt word breaking as thats cpu expensive > > and being done by the parser already > > > > But then how can we limit the extracted text based on the number of > words?
Well IMHO It should be limited by bytes in the extractor not words (as per 0.6.x) - this is cheap and works well The parser will do the word limits when it breaks/normalizes them So really just need to guestimate bytes to extract if a word limit is specified - the extractor does not need to be precise here and if you assumed say average byte count of a word was 20 bytes the you will probably be ok. If the extractor extracts too many words the parser will still limit it to the precise number of words so no harm is done Of course others may have other ideas but it does sound daft to me to word break everything twice jamie _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
