UTF-8 handles most cases but I still have to deal with emails in
ISO2022-JP, Shift-JIS and EUC-JP. After some research it seems Xapian has
no support for Asian languages. I will try to make some tests and open an
issue if I cannot make it work.

I can see in the sup configuration file that the stem language can be
configured but there are no CJK stemmers for Xapian that I can find.


On Thu, May 2, 2013 at 5:17 PM, Gaute Hope <e...@gaute.vetsj.com> wrote:

>
>
> On 30. april 2013 11:44, Horacio Sanson wrote:
> > Great to see Sup getting back on track again..
> >
> > I submitted some patches for the Gmail dumper of Heliotrope some time ago
> > but the lack of non alphabet languages (Japanese, Chinese) made it
> > impossible for me to keep using heliotrope/turnesole.
> >
> > The main issue to support Japanese/Chinese with heliotrope was that
> > whistlepig (indexer) lacked the ability to tokenize these languages. Also
> > the half baked UTF-8 support caused several issues with these languages.
> >
> > I would like to help in testing/implementing support for these languages,
> > starting with Japanese, but I would require some guidance. First I would
> > like to know is there is a way to configure the Xapian tokenizer
> > (segmenter) within sup? Please consider that I am new to both sup and to
> > Xapian.
>
> Hi Horacio,
>
> consider opening an issue at
> https://github.com/sup-heliotrope/sup/issues to make sure this doesn't
> disappear. Some changes will probably be made to the indexer when going
> to Mail (from RMail), but I hope to be able to migrate the existing
> index. Perhaps its time to get it right for arbitrary languages as well.
> I am unfamiliar with Japanes/Chinese - does UTF-8 cover the needs?
>
> Mail is better at handling UTF-8 and I think there was some fork that
> had some extra support for Japanese.
>
> Regards, Gaute
>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel

Reply via email to