Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

Aleksander Morgado Mon, 26 Apr 2010 07:03:38 -0700

> > 
> > I think it is silly to use 2 different libraries to do the same thing 
> > and if one does things better than another...
> 
> Its way too slow to use CJK breaking on non-CJK text - really the parser
> checks the language before using the appropriate algorithm. The
> extractor lacks the intelligence to do it efficiently
>


It's probably wrong to just assume CJK-word-breaking and
non-CJK-word-breaking. What if the input string has mixed CJK and latin
characters?

> > >>> IMHO, The tracker_text_normalize() in the extractor should just do utf8
> > >>> validation. It should not attempt word breaking as thats cpu expensive
> > >>> and being done by the parser already
> > 
> > Well, extraction already is pretty expensive. I see your point there but 
> > also, it doesn't make sense to send n bytes over d-bus that won't be 
> > used either. So really it is the lesser of two evils. Currently we do 
> > push a lot of data over d-bus.
> 
> sure its a trade off 
> 
> I just think word limits should be estimated or ignored in the
> extractors (we have a byte limit as well as a word limit in any event)
> 

Regarding the word-break in the extraction, it was agreed not to do it
and apply just a max-bytes limit in the extractors:
https://bugzilla.gnome.org/show_bug.cgi?id=616845

Cheers!
-- 
Aleksander



_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

Reply via email to