Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

Aleksander Morgado Mon, 26 Apr 2010 02:11:12 -0700

Some small comments here.

> >>
> >>>> I think it makes sense to fix this. Just to be clear, does this mean we
> >>>> don't need Pango in libtracker-fts/tracker-parser.c to determine word
> >>>> breaks for CJK?
> >>>
> >>> Thats not broken so would not recommend trying to "fix" that
> 
> Well, given the details Aleksander demonstrated previously in this 
> thread, word breaking for Chinese symbols is broken and yes that should 
> be fixed.
>


Word breaking is broken currently in the extractor, don't really know in
the parser (currently it's being done twice). My previous-thread word
break examples where with the algorithm being used in the extractors.

In the parser, I saw that pango is being used for word-breaking if CJK
(pango_next()), and a custom word-breaking otherwise (tracker_next()).
The custom word-breaking doesn't seem to be based on any Unicode rule
for word-breaking, and thus, it will probably fail in lots of corner
cases, where if Unicode-standard-based it wouldn't. Then, the
pango-version for word-breaking really seems to be
Unicode-standard-based, and so is GNU libunistring.

What I right now don't quite see pretty well would be to use the custom
word-breaking algorithm if no CJK characters. CJK is a special case, but
there are lots of other non-CJK special cases that should also be
considered...

As Jamie said, pango-version of word breaking is quite slow, compared to
the custom word-breaking... but the custom word-breaking is doing it
wrong compared to a proper Unicode-standard-based word breaking like the
one in pango. Maybe it's worth to use the correct method even if
slower... 


> I think it is silly to use 2 different libraries to do the same thing 
> and if one does things better than another...
> 

Right now, can't say if libunistring will be faster than pango for a
proper Unicode-based word-breaking. Would need to look at that.

Cheers,
-- 
Aleksander




_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

Reply via email to