On Mon, 2010-04-26 at 09:54 +0100, Martyn Russell wrote: > On 25/04/10 21:59, Jamie McCracken wrote: > > On Sun, 2010-04-25 at 22:34 +0200, Aleksander Morgado wrote: > >> Hi Jamie, > >> > >>>> I think it makes sense to fix this. Just to be clear, does this mean we > >>>> don't need Pango in libtracker-fts/tracker-parser.c to determine word > >>>> breaks for CJK? > >>> > >>> Thats not broken so would not recommend trying to "fix" that > > Well, given the details Aleksander demonstrated previously in this > thread, word breaking for Chinese symbols is broken and yes that should > be fixed.
its not broken in the parser AFAIK - the parser is heavily optimised for breaking and works well with CJK (via pango). > > I think it is silly to use 2 different libraries to do the same thing > and if one does things better than another... Its way too slow to use CJK breaking on non-CJK text - really the parser checks the language before using the appropriate algorithm. The extractor lacks the intelligence to do it efficiently > > >>> IMHO, The tracker_text_normalize() in the extractor should just do utf8 > >>> validation. It should not attempt word breaking as thats cpu expensive > >>> and being done by the parser already > > Well, extraction already is pretty expensive. I see your point there but > also, it doesn't make sense to send n bytes over d-bus that won't be > used either. So really it is the lesser of two evils. Currently we do > push a lot of data over d-bus. sure its a trade off I just think word limits should be estimated or ignored in the extractors (we have a byte limit as well as a word limit in any event) _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
