Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

Jamie McCracken Mon, 26 Apr 2010 06:47:52 -0700

On Mon, 2010-04-26 at 09:54 +0100, Martyn Russell wrote:
> On 25/04/10 21:59, Jamie McCracken wrote:
> > On Sun, 2010-04-25 at 22:34 +0200, Aleksander Morgado wrote:
> >> Hi Jamie,
> >>
> >>>> I think it makes sense to fix this. Just to be clear, does this mean we
> >>>> don't need Pango in libtracker-fts/tracker-parser.c to determine word
> >>>> breaks for CJK?
> >>>
> >>> Thats not broken so would not recommend trying to "fix" that
> 
> Well, given the details Aleksander demonstrated previously in this 
> thread, word breaking for Chinese symbols is broken and yes that should 
> be fixed.


its not broken in the parser AFAIK - the parser is heavily optimised for
breaking and works well with CJK (via pango). 

> 
> I think it is silly to use 2 different libraries to do the same thing 
> and if one does things better than another...

Its way too slow to use CJK breaking on non-CJK text - really the parser
checks the language before using the appropriate algorithm. The
extractor lacks the intelligence to do it efficiently



> 
> >>> IMHO, The tracker_text_normalize() in the extractor should just do utf8
> >>> validation. It should not attempt word breaking as thats cpu expensive
> >>> and being done by the parser already
> 
> Well, extraction already is pretty expensive. I see your point there but 
> also, it doesn't make sense to send n bytes over d-bus that won't be 
> used either. So really it is the lesser of two evils. Currently we do 
> push a lot of data over d-bus.

sure its a trade off 

I just think word limits should be estimated or ignored in the
extractors (we have a byte limit as well as a word limit in any event)


_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

Reply via email to