Hi watanabe-san,

In current implementation, search runs on the browser side, not
server-side. So you need download the dictionary.
However, n-gram's dictionary may become fairly big and hard to download.

How about using oktavia (http://oktavia.info/) ?
It can search any language very fast with a small dictionary by using
FM-index and can be used from Sphinx.

WAKAYAMA Shirou


2014/1/7 Kevin Horn <[email protected]>:
> This isn't necessarily about Asian languages, but if you're interested in
> FTS for Sphinx, you may want to take a look at the whoosh builder extension,
> in the sphinx-contrib repo:
> https://bitbucket.org/birkenfeld/sphinx-contrib
>
>
>
> On Mon, Jan 6, 2014 at 4:41 AM, Hiroki Watanabe <[email protected]>
> wrote:
>>
>> Hello,
>>
>> > Takayuki SHIMIZUKAWA wrote:
>> > FYI, the sphinx built-in search feature provides 2 language mode:'en'
>> > and 'ja'.
>>
>> Does sphinx hava a plan to introduce a language independent tokenizer into
>> Sphinx to support not only Japanese but also Chinese, Korean and Thai.
>> These Asian languages also are not separated by white-space like Japanese.
>>
>> TinySegmenter, which is Sphinx's tokenizer for Japanese, does not work
>> well for Chinese/Korean/Thai.
>>
>> I tested TinySegmenter on Chinese and Korean by TinySegmenter Online Demo.
>>
>> TinySegmenter Online Demo:
>> http://chasen.org/~taku/software/TinySegmenter/
>>
>> And the followings are results:
>>
>> 北京首都国际机场 (Beijing Capital International Airport)
>> TinySegmenter: 北京首 | 都国 | 际机 | 场
>> Expected: 北京 | 首都 | 国际 | 机场
>>
>> 인천국제공항 (Incheon International Airport)
>> TinySegmenter: 인 | 천 | 국제 | 공 | 항
>> Expected: 인천 | 국제 | 공항
>>
>> As you see, TinySegmenter does not work well for these languages.
>>
>> I think Mozilla Thunderbird team's approach can be adapted to sphinx also.
>> The following site descries that they had a problem their full test search
>> did not work for CJK and how they solved it.
>>
>> Thunderbird 3.0 global / full-text search support for CJK languages
>> landed,
>> will show up in nightlies tomorrow, requires a new database.
>>
>> https://groups.google.com/forum/#!topic/mozilla.dev.apps.thunderbird/v0_gbw4LIKo
>>
>> They solved it by enhancing SQLite's porter tokenizer with bi-gram
>> algorithm.
>>
>> SQLite fts3_porter.c which is enhanced with bi-gram algorithm by Mozilla
>> Thunderbird team:
>>
>> http://hg.mozilla.org/comm-central/file/tip/mailnews/extensions/fts3/src/fts3_porter.c
>>
>> I think introducing SQLite FTS into sphinx may be difficult and not
>> appropriate, but their approach itself is valuable to be considered to
>> support multi-language search function.
>>
>> Best regard,
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "sphinx-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/sphinx-users.
>> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>
>
> --
> --
> Kevin Horn
>
> --
> You received this message because you are subscribed to the Google Groups
> "sphinx-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/sphinx-users.
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"sphinx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sphinx-users.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to