Hi all! I'm currently analyzing the issue reported at GB#579756 (Unicode Normalization is broken in Indexer and/or Search): https://bugzilla.gnome.org/show_bug.cgi?id=579756
All my comments below apply to the contents of nie:plainTextContent, not really directly related to the bug report, which may still be some issue in the FTS algorithm. Normalization: Shouldn't tracker use a single Unicode normalization form for the list of words stored in nie:plainTextContent? For text search, a decomposed form would probably be preferred, like NFD. This would mean calling g_utf8_normalize() with G_NORMALIZE_NFD argument for each string to be added in nie:plainTextContent. Word breaks: When text content is extracted from several doc types (msoffice, oasis, pdf...), a simple word break algorithm is used, basically looking for letters. This algorithm is far from perfect, as it doesn't follow the common rules for word-breaking in UAX#29 http://unicode.org/reports/tr29/#Word_Boundaries . As an example, a file containing the following 3 strings (english 1st, chinese second, japanese-katakana last): "Simple english text\n 本州最主流的风味,使用日本酱油、鸡肉和蔬菜。可隨個人喜好加入油辣和胡椒。 \n ホモ・サピエンス" With the current algorithm (tracker_text_normalize() in libtracker-extract), only 10 words are found, and separated with whitespaces in the following way: "Simple english text 本州最主流的风味 使用日本酱油 鸡肉和蔬菜 可隨個人喜 好加入油辣和胡椒 ホモ サピエンス" While with a proper word-break detection algorithm, you would find 37 correct words: "Simple english text 本 州 最 主 流 的 风 味 使 用 日 本 酱 油 鸡 肉 和 蔬 菜 可 隨 個 人 喜 好 加 入 油 辣 和 胡 椒 ホモ サピエンス" Chinese symbols are considered separate words, while katakana symbols are not. This is just an example of how a proper word detection should be done. I already have a custom version of tracker_text_normalize() which properly does the word-break detection, using GNU libunistring. Now, if applied, should libunistring be a mandatory dependency for tracker? Another option would probably be using pango, but I doubt pango is a good dependency for libtracker-extract. Comments welcome... -- Aleksander _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
