This isn't necessarily about Asian languages, but if you're interested in
FTS for Sphinx, you may want to take a look at the whoosh builder
extension, in the sphinx-contrib repo:
https://bitbucket.org/birkenfeld/sphinx-contrib



On Mon, Jan 6, 2014 at 4:41 AM, Hiroki Watanabe
<[email protected]>wrote:

> Hello,
>
> > Takayuki SHIMIZUKAWA wrote:
> > FYI, the sphinx built-in search feature provides 2 language mode:'en'
> and 'ja'.
>
> Does sphinx hava a plan to introduce a language independent tokenizer into
> Sphinx to support not only Japanese but also Chinese, Korean and Thai.
> These Asian languages also are not separated by white-space like Japanese.
>
> TinySegmenter, which is Sphinx's tokenizer for Japanese, does not work
> well for Chinese/Korean/Thai.
>
> I tested TinySegmenter on Chinese and Korean by TinySegmenter Online Demo.
>
> TinySegmenter Online Demo:
> http://chasen.org/~taku/software/TinySegmenter/
>
> And the followings are results:
>
> 北京首都国际机场 (Beijing Capital International Airport)
> TinySegmenter: 北京首 | 都国 | 际机 | 场
> Expected: 北京 | 首都 | 国际 | 机场
>
> 인천국제공항 (Incheon International Airport)
> TinySegmenter: 인 | 천 | 국제 | 공 | 항
> Expected: 인천 | 국제 | 공항
>
> As you see, TinySegmenter does not work well for these languages.
>
> I think Mozilla Thunderbird team's approach can be adapted to sphinx also.
> The following site descries that they had a problem their full test search
> did not work for CJK and how they solved it.
>
> Thunderbird 3.0 global / full-text search support for CJK languages landed,
> will show up in nightlies tomorrow, requires a new database.
>
> https://groups.google.com/forum/#!topic/mozilla.dev.apps.thunderbird/v0_gbw4LIKo
>
> They solved it by enhancing SQLite's porter tokenizer with bi-gram
> algorithm.
>
> SQLite fts3_porter.c which is enhanced with bi-gram algorithm by Mozilla
> Thunderbird team:
>
> http://hg.mozilla.org/comm-central/file/tip/mailnews/extensions/fts3/src/fts3_porter.c
>
> I think introducing SQLite FTS into sphinx may be difficult and not
> appropriate, but their approach itself is valuable to be considered to
> support multi-language search function.
>
> Best regard,
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "sphinx-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/sphinx-users.
> For more options, visit https://groups.google.com/groups/opt_out.
>



-- 
--
Kevin Horn

-- 
You received this message because you are subscribed to the Google Groups 
"sphinx-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sphinx-users.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to