Some small comments here. > >> > >>>> I think it makes sense to fix this. Just to be clear, does this mean we > >>>> don't need Pango in libtracker-fts/tracker-parser.c to determine word > >>>> breaks for CJK? > >>> > >>> Thats not broken so would not recommend trying to "fix" that > > Well, given the details Aleksander demonstrated previously in this > thread, word breaking for Chinese symbols is broken and yes that should > be fixed. >
Word breaking is broken currently in the extractor, don't really know in the parser (currently it's being done twice). My previous-thread word break examples where with the algorithm being used in the extractors. In the parser, I saw that pango is being used for word-breaking if CJK (pango_next()), and a custom word-breaking otherwise (tracker_next()). The custom word-breaking doesn't seem to be based on any Unicode rule for word-breaking, and thus, it will probably fail in lots of corner cases, where if Unicode-standard-based it wouldn't. Then, the pango-version for word-breaking really seems to be Unicode-standard-based, and so is GNU libunistring. What I right now don't quite see pretty well would be to use the custom word-breaking algorithm if no CJK characters. CJK is a special case, but there are lots of other non-CJK special cases that should also be considered... As Jamie said, pango-version of word breaking is quite slow, compared to the custom word-breaking... but the custom word-breaking is doing it wrong compared to a proper Unicode-standard-based word breaking like the one in pango. Maybe it's worth to use the correct method even if slower... > I think it is silly to use 2 different libraries to do the same thing > and if one does things better than another... > Right now, can't say if libunistring will be faster than pango for a proper Unicode-based word-breaking. Would need to look at that. Cheers, -- Aleksander _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
