On Tue, 2010-05-04 at 20:11 +0200, Aleksander Morgado wrote: > Hi all again,
Hi, > > > > I've been playing with substituting the two word break algorithms in > > libtracker-fts (custom for non-CJK and pango-based for CJK) with a > > single one using GNU libunistring (LGPLv3). Note that libicu (ICU > > license) is also probably a good choice instead of libunistring. > > http://www.gnu.org/software/libunistring > > http://site.icu-project.org > > I have been following this thread closely. I am quite pleased with the work done here and I would like to get this into master at some point. We are planning on having a code camp hopefully in the next month or so in Helsinki. If that goes to plan, this is likely to get discussed in more detail then (if it isn't already committed). I would like this thread and branch to get more comments and review before we go ahead. Generally, I am all for it, but more eyes won't hurt :) My general thoughts are that we should use libunistring if we can, but I think for Meego, that's not going to happen due to licensing issues (LGPLv3 is not favourable here). However, libicu seems the next best thing for now. > You can get the changes from the 'parser-unicode-libs-review' branch in > gnome git. I will take time to review this soon. Right now there are a bunch of other things to take care of first. The code changes are probably all in a place which doesn't change that often anyway I am guessing (libtracker-fts?) I haven't looked yet. > I added a new option in configure to be able to select the desired > unicode support library: > --with-unicode-support=[glib|libunistring|libicu] > Currently it defaults to 'glib' if not specified. Makes sense. Allowing build time configuration of libglib/libicu/libunistring seems fine given the circumstances with the licensing. I would make it optional using libunistring, then libicu, then glib as a fallback (using automatic detection in configure). > Also developed a tester which uses the parser in libtracker-fts, > available in tests/libtracker-fts/tracker-parser-test.c > Once compiled, you can use --file to specify the file to parse. Perfect. Testing is quite important here. If we have a test case that is automated, you get extra marks :) since we run that before each release to sanity check our code base. > I did several tests using the new tester, which seem to be more accurate > than the first tests I did last week, as in these new ones the results > only depend on the parser implementation, and not on the miner-fs for > example. Great. > Attached is a short spreadsheet with some numbers I got using my set of > test files. I measured three different things: > * The time it takes for each parser to parse each file. > * The number of words obtained with each parser in each file. > * The contents of the output words. > > All the result files are available at: > http://www.lanedo.com/~aleksander/gnome-tracker/tracker-parser-unicode-libraries/ I think we should put the results in docs/ so people can see why we have decided to use these libraries in that order and what tests have been done. > Some conclusions from the tests > > 1) Both libunistring and libicu based parsers have exactly the same > output in all tests: same number of words, same word contents. That's good to see too. > 2) The number of words detected by the glib(custom/pango) parser and > their contents are usually completely different than the number of words > detected by the others: > * In a chinese-only file, for example, while libunistring/libicu both > detect 1202 words, the glib(custom/pango) parser detects only 188. > * In a file with mixed languages, glib(custom/pango) detects 22105 > words while the others detect 33472 words. It is really amazing to see glib isn't picking up word breaks properly for some Asian languages. For me, this is more reason to shift away from glib when possible. > 3) GNU libunistring seems to be around 9%-10% faster than libicu > (probably because of the conversions to/from UChars, which are UTF-16 > encoded strings. libunistring's API can work directly with UTF-8). This > comparison is very realistic considering that both parsers have exactly > the same output results. > > 4) glib(custom/pango) time results are almost all of them better than > the ones from libunistring/libicu. This is not surprising as the number > of words detected by glib parser are much less. Thus, these timing > values cannot really be compared. I would imagine that's a contributing factor yes. > 5) Pango-based word break is really slow. In a 180k mixed-language file: > * libunistring needed 1.01 seconds > * libicu needed 1.10 seconds > * glib(pango) needed 22 seconds! 22 seconds is really bad, if we can avoid that, it would be great. > 6) More situations where glib(custom/pango) parser doesn't work > properly: > * When input string is decomposed (NFD) (as with the "école issue" in > the testcaseNFD.txt file in the tests) > * Special case-folding cases (as with the "groß/gross issue" in the > gross-1.txt file in the tests) > Both libunistring and libicu behave perfectly in the previous cases. These cases are really what we need to fix. > Finally, I re-paste the pending issues, as they still are the same: > > > > > Pending issues > > ---------------------------------- > > 1) The current non-CJK word-break algorithm assumes that a word starts > > either with a letter, a number or a underscore (correct me if wrong, > > please). Not sure why the underscore, but anyway in the > > libunistring-based parser I also included any symbol as a valid word > > starter character. This actually means that lots of new words are being > > considered, specially if parsing source code (like '+', '-' and such). > > Probably symbols should be removed from the list of valid word starter > > characters, so suggestions welcome. Jamie mentions this is for functions in source code. Personally, I wouldn't mind ignoring those. They are usually private functions and less interesting. As for numbers, I am sitting on the fence with that one. It is quite hard to predict useful numbers without context. Mikael will have an opinion here I would think. We should consider benchmarking some data to see if this is really worth it or not we do get a few bugs or comments on IRC occasionally asking for FTS matching by numbers or criteria starting with numbers. Great work Aleksander, thanks again! -- Regards, Martyn _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
