Hi Martyn,

> 
> > I added a new option in configure to be able to select the desired
> > unicode support library:
> > --with-unicode-support=[glib|libunistring|libicu]
> > Currently it defaults to 'glib' if not specified.
> 
> Makes sense.
> 
> Allowing build time configuration of libglib/libicu/libunistring seems
> fine given the circumstances with the licensing. I would make it
> optional using libunistring, then libicu, then glib as a fallback (using
> automatic detection in configure).

Done. New order is libunistring/libicu/glib

> 
> > Also developed a tester which uses the parser in libtracker-fts,
> > available in tests/libtracker-fts/tracker-parser-test.c
> > Once compiled, you can use --file to specify the file to parse.
> 
> Perfect. Testing is quite important here. If we have a test case that is
> automated, you get extra marks :) since we run that before each release
> to sanity check our code base.
> 

Yeah, will add some automated unit tests, at least checking the outputs
of the parsing.

> > Attached is a short spreadsheet with some numbers I got using my set of
> > test files. I measured three different things:
> >  * The time it takes for each parser to parse each file.
> >  * The number of words obtained with each parser in each file.
> >  * The contents of the output words.
> > 
> > All the result files are available at:
> > http://www.lanedo.com/~aleksander/gnome-tracker/tracker-parser-unicode-libraries/
> 
> I think we should put the results in docs/ so people can see why we have
> decided to use these libraries in that order and what tests have been
> done.
> 

You mean in the wiki, right? Will prepare some text.

> > 6) More situations where glib(custom/pango) parser doesn't work
> > properly:
> >  * When input string is decomposed (NFD) (as with the "école issue" in
> > the testcaseNFD.txt file in the tests)
> >  * Special case-folding cases (as with the "groß/gross issue" in the
> > gross-1.txt file in the tests)
> > Both libunistring and libicu behave perfectly in the previous cases.
> 
> These cases are really what we need to fix.
> 

Those would be fixed already without any further fix with the
libunistring/libicu implementations.


> > > Pending issues
> > > ----------------------------------
> > > 1) The current non-CJK word-break algorithm assumes that a word starts
> > > either with a letter, a number or a underscore (correct me if wrong,
> > > please). Not sure why the underscore, but anyway in the
> > > libunistring-based parser I also included any symbol as a valid word
> > > starter character. This actually means that lots of new words are being
> > > considered, specially if parsing source code (like '+', '-' and such).
> > > Probably symbols should be removed from the list of valid word starter
> > > characters, so suggestions welcome.
> 
> Jamie mentions this is for functions in source code. Personally, I
> wouldn't mind ignoring those. They are usually private functions and
> less interesting. As for numbers, I am sitting on the fence with that
> one. It is quite hard to predict useful numbers without context. Mikael
> will have an opinion here I would think.
> 

That's fully right. Without the proper context, it's very difficult to
see if numbers are really useful information or not. Phone numbers are a
clear case of useful info we shouldn't be filtering, I guess.


Cheers!

-- 
Aleksander

_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to