Hi Martyn, > > > I added a new option in configure to be able to select the desired > > unicode support library: > > --with-unicode-support=[glib|libunistring|libicu] > > Currently it defaults to 'glib' if not specified. > > Makes sense. > > Allowing build time configuration of libglib/libicu/libunistring seems > fine given the circumstances with the licensing. I would make it > optional using libunistring, then libicu, then glib as a fallback (using > automatic detection in configure).
Done. New order is libunistring/libicu/glib > > > Also developed a tester which uses the parser in libtracker-fts, > > available in tests/libtracker-fts/tracker-parser-test.c > > Once compiled, you can use --file to specify the file to parse. > > Perfect. Testing is quite important here. If we have a test case that is > automated, you get extra marks :) since we run that before each release > to sanity check our code base. > Yeah, will add some automated unit tests, at least checking the outputs of the parsing. > > Attached is a short spreadsheet with some numbers I got using my set of > > test files. I measured three different things: > > * The time it takes for each parser to parse each file. > > * The number of words obtained with each parser in each file. > > * The contents of the output words. > > > > All the result files are available at: > > http://www.lanedo.com/~aleksander/gnome-tracker/tracker-parser-unicode-libraries/ > > I think we should put the results in docs/ so people can see why we have > decided to use these libraries in that order and what tests have been > done. > You mean in the wiki, right? Will prepare some text. > > 6) More situations where glib(custom/pango) parser doesn't work > > properly: > > * When input string is decomposed (NFD) (as with the "école issue" in > > the testcaseNFD.txt file in the tests) > > * Special case-folding cases (as with the "groß/gross issue" in the > > gross-1.txt file in the tests) > > Both libunistring and libicu behave perfectly in the previous cases. > > These cases are really what we need to fix. > Those would be fixed already without any further fix with the libunistring/libicu implementations. > > > Pending issues > > > ---------------------------------- > > > 1) The current non-CJK word-break algorithm assumes that a word starts > > > either with a letter, a number or a underscore (correct me if wrong, > > > please). Not sure why the underscore, but anyway in the > > > libunistring-based parser I also included any symbol as a valid word > > > starter character. This actually means that lots of new words are being > > > considered, specially if parsing source code (like '+', '-' and such). > > > Probably symbols should be removed from the list of valid word starter > > > characters, so suggestions welcome. > > Jamie mentions this is for functions in source code. Personally, I > wouldn't mind ignoring those. They are usually private functions and > less interesting. As for numbers, I am sitting on the fence with that > one. It is quite hard to predict useful numbers without context. Mikael > will have an opinion here I would think. > That's fully right. Without the proper context, it's very difficult to see if numbers are really useful information or not. Phone numbers are a clear case of useful info we shouldn't be filtering, I guess. Cheers! -- Aleksander _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
