Hi watanabe-san, In current implementation, search runs on the browser side, not server-side. So you need download the dictionary. However, n-gram's dictionary may become fairly big and hard to download.
How about using oktavia (http://oktavia.info/) ? It can search any language very fast with a small dictionary by using FM-index and can be used from Sphinx. WAKAYAMA Shirou 2014/1/7 Kevin Horn <[email protected]>: > This isn't necessarily about Asian languages, but if you're interested in > FTS for Sphinx, you may want to take a look at the whoosh builder extension, > in the sphinx-contrib repo: > https://bitbucket.org/birkenfeld/sphinx-contrib > > > > On Mon, Jan 6, 2014 at 4:41 AM, Hiroki Watanabe <[email protected]> > wrote: >> >> Hello, >> >> > Takayuki SHIMIZUKAWA wrote: >> > FYI, the sphinx built-in search feature provides 2 language mode:'en' >> > and 'ja'. >> >> Does sphinx hava a plan to introduce a language independent tokenizer into >> Sphinx to support not only Japanese but also Chinese, Korean and Thai. >> These Asian languages also are not separated by white-space like Japanese. >> >> TinySegmenter, which is Sphinx's tokenizer for Japanese, does not work >> well for Chinese/Korean/Thai. >> >> I tested TinySegmenter on Chinese and Korean by TinySegmenter Online Demo. >> >> TinySegmenter Online Demo: >> http://chasen.org/~taku/software/TinySegmenter/ >> >> And the followings are results: >> >> 北京首都国际机场 (Beijing Capital International Airport) >> TinySegmenter: 北京首 | 都国 | 际机 | 场 >> Expected: 北京 | 首都 | 国际 | 机场 >> >> 인천국제공항 (Incheon International Airport) >> TinySegmenter: 인 | 천 | 국제 | 공 | 항 >> Expected: 인천 | 국제 | 공항 >> >> As you see, TinySegmenter does not work well for these languages. >> >> I think Mozilla Thunderbird team's approach can be adapted to sphinx also. >> The following site descries that they had a problem their full test search >> did not work for CJK and how they solved it. >> >> Thunderbird 3.0 global / full-text search support for CJK languages >> landed, >> will show up in nightlies tomorrow, requires a new database. >> >> https://groups.google.com/forum/#!topic/mozilla.dev.apps.thunderbird/v0_gbw4LIKo >> >> They solved it by enhancing SQLite's porter tokenizer with bi-gram >> algorithm. >> >> SQLite fts3_porter.c which is enhanced with bi-gram algorithm by Mozilla >> Thunderbird team: >> >> http://hg.mozilla.org/comm-central/file/tip/mailnews/extensions/fts3/src/fts3_porter.c >> >> I think introducing SQLite FTS into sphinx may be difficult and not >> appropriate, but their approach itself is valuable to be considered to >> support multi-language search function. >> >> Best regard, >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "sphinx-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/sphinx-users. >> For more options, visit https://groups.google.com/groups/opt_out. > > > > > -- > -- > Kevin Horn > > -- > You received this message because you are subscribed to the Google Groups > "sphinx-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/sphinx-users. > For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups "sphinx-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/sphinx-users. For more options, visit https://groups.google.com/groups/opt_out.
