On 22/04/10 17:34, Aleksander Morgado wrote:
Hi all!
Hi,
Word breaks:
When text content is extracted from several doc types (msoffice, oasis,
pdf...), a simple word break algorithm is used, basically looking for
letters. This algorithm is far from perfect, as it doesn't follow the
common rules for word-breaking in UAX#29
http://unicode.org/reports/tr29/#Word_Boundaries .
As an example, a file containing the following 3 strings (english 1st,
chinese second, japanese-katakana last):
"Simple english text\n
本州最主流的风味,使用日本酱油、鸡肉和蔬菜。可隨個人喜好加入油辣和胡椒。
\n
ホモ・サピエンス"
With the current algorithm (tracker_text_normalize() in
libtracker-extract), only 10 words are found, and separated with
whitespaces in the following way:
"Simple english text 本州最主流的风味 使用日本酱油 鸡肉和蔬菜 可隨個人喜
好加入油辣和胡椒 ホモ サピエンス"
While with a proper word-break detection algorithm, you would find 37
correct words:
"Simple english text 本 州 最 主 流 的 风 味 使 用 日 本 酱 油 鸡 肉 和
蔬 菜 可 隨 個 人 喜 好 加 入 油 辣 和 胡 椒 ホモ サピエンス"
Chinese symbols are considered separate words, while katakana symbols
are not. This is just an example of how a proper word detection should
be done.
I already have a custom version of tracker_text_normalize() which
properly does the word-break detection, using GNU libunistring. Now, if
applied, should libunistring be a mandatory dependency for tracker?
Another option would probably be using pango, but I doubt pango is a
good dependency for libtracker-extract.
Thanks Aleksander.
I think it makes sense to fix this. Just to be clear, does this mean we
don't need Pango in libtracker-fts/tracker-parser.c to determine word
breaks for CJK?
I have no idea what libunistring is like, we should probably quickly
evaluate it before adopting it. It sounds like you have experience there
though.
--
Regards,
Martyn
_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list