Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-30 Thread Charlie Hull

On 30/09/2015 04:09, Zheng Lin Edwin Yeo wrote:

Hi Charlie,


Hi,


I've checked that Paoding's code is written for Solr 3 and Solr 4 versions.
It is not written for Solr 5, thus I was unable to use it in my Solr 5.x
version.


I'm pretty sure we had to recompile it for v4.6 as wellit has been a 
little painful.


Have you tried to use HMMChineseTokenizer and JiebaTokenizer as well?


I don't think so.

Charlie


Regards,
Edwin


On 25 September 2015 at 18:46, Charlie Hull  wrote:


On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:


Hi Charlie,

Thanks for your comment. I faced the compatibility issues with Paoding
when
I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code was
optimised for Solr 3.6.

Which version of Solr are you using when you tried on the Paoding?



Solr v4.6 I believe.

Charlie



Regards,
Edwin


On 25 September 2015 at 16:43, Charlie Hull  wrote:

On 23/09/2015 16:23, Alexandre Rafalovitch wrote:


You may find the following articles interesting:



http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
( a whole epic journey)
https://dzone.com/articles/indexing-chinese-solr



The latter article is great and we drew on it when helping a recent
client
with Chinese indexing. However, if you do use Paoding bear in mind that
it
has few if any tests and all the comments are in Chinese. We found a
problem with it recently (it breaks the Lucene highlighters) and have
submitted a patch:
http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1

Cheers

Charlie


Regards,

  Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo <
edwinye...@gmail.com>
wrote:

Hi,


Would like to check, will StandardTokenizerFactory works well for
indexing
both English and Chinese (Bilingual) documents, or do we need
tokenizers
that are customised for chinese (Eg: HMMChineseTokenizerFactory)?


Regards,
Edwin





--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-30 Thread Charlie Hull

On 30/09/2015 10:13, Zheng Lin Edwin Yeo wrote:

Hi Charlie,


Hi Edwin,


Thanks for your reply. Seems like quite a number of the chinese tokenizers
are not really compatible with the newer versions of Solr

I'm also looking at HMMChineseTokenizer and JiebaTokenizer to see if they
are suitable to be used for Solr 5.x too.


I think there is a general lack of knowledge (at least in the 
non-Chinese-speaking community) about the best way to analyze Chinese 
content with Lucene/Solr - so if you can write up your experiences that 
would be great!


Cheers

Charlie



Regards,
Edwin


On 30 September 2015 at 16:20, Charlie Hull  wrote:


On 30/09/2015 04:09, Zheng Lin Edwin Yeo wrote:


Hi Charlie,



Hi,



I've checked that Paoding's code is written for Solr 3 and Solr 4
versions.
It is not written for Solr 5, thus I was unable to use it in my Solr 5.x
version.



I'm pretty sure we had to recompile it for v4.6 as wellit has been a
little painful.



Have you tried to use HMMChineseTokenizer and JiebaTokenizer as well?



I don't think so.


Charlie



Regards,
Edwin


On 25 September 2015 at 18:46, Charlie Hull  wrote:

On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:


Hi Charlie,


Thanks for your comment. I faced the compatibility issues with Paoding
when
I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code
was
optimised for Solr 3.6.

Which version of Solr are you using when you tried on the Paoding?



Solr v4.6 I believe.

Charlie


Regards,

Edwin


On 25 September 2015 at 16:43, Charlie Hull  wrote:

On 23/09/2015 16:23, Alexandre Rafalovitch wrote:



You may find the following articles interesting:





http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
( a whole epic journey)
https://dzone.com/articles/indexing-chinese-solr


The latter article is great and we drew on it when helping a recent

client
with Chinese indexing. However, if you do use Paoding bear in mind that
it
has few if any tests and all the comments are in Chinese. We found a
problem with it recently (it breaks the Lucene highlighters) and have
submitted a patch:
http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1

Cheers

Charlie


Regards,


   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo <
edwinye...@gmail.com>
wrote:

Hi,



Would like to check, will StandardTokenizerFactory works well for
indexing
both English and Chinese (Bilingual) documents, or do we need
tokenizers
that are customised for chinese (Eg: HMMChineseTokenizerFactory)?


Regards,
Edwin




--

Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-30 Thread Zheng Lin Edwin Yeo
Hi Charlie,

Thanks for your reply. Seems like quite a number of the chinese tokenizers
are not really compatible with the newer versions of Solr

I'm also looking at HMMChineseTokenizer and JiebaTokenizer to see if they
are suitable to be used for Solr 5.x too.

Regards,
Edwin


On 30 September 2015 at 16:20, Charlie Hull  wrote:

> On 30/09/2015 04:09, Zheng Lin Edwin Yeo wrote:
>
>> Hi Charlie,
>>
>
> Hi,
>
>>
>> I've checked that Paoding's code is written for Solr 3 and Solr 4
>> versions.
>> It is not written for Solr 5, thus I was unable to use it in my Solr 5.x
>> version.
>>
>
> I'm pretty sure we had to recompile it for v4.6 as wellit has been a
> little painful.
>
>>
>> Have you tried to use HMMChineseTokenizer and JiebaTokenizer as well?
>>
>
> I don't think so.
>
>
> Charlie
>
>>
>> Regards,
>> Edwin
>>
>>
>> On 25 September 2015 at 18:46, Charlie Hull  wrote:
>>
>> On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:
>>>
>>> Hi Charlie,

 Thanks for your comment. I faced the compatibility issues with Paoding
 when
 I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code
 was
 optimised for Solr 3.6.

 Which version of Solr are you using when you tried on the Paoding?


>>> Solr v4.6 I believe.
>>>
>>> Charlie
>>>
>>>
>>> Regards,
 Edwin


 On 25 September 2015 at 16:43, Charlie Hull  wrote:

 On 23/09/2015 16:23, Alexandre Rafalovitch wrote:

>
> You may find the following articles interesting:
>
>>
>>
>>
>> http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
>> ( a whole epic journey)
>> https://dzone.com/articles/indexing-chinese-solr
>>
>>
>> The latter article is great and we drew on it when helping a recent
> client
> with Chinese indexing. However, if you do use Paoding bear in mind that
> it
> has few if any tests and all the comments are in Chinese. We found a
> problem with it recently (it breaks the Lucene highlighters) and have
> submitted a patch:
> http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1
>
> Cheers
>
> Charlie
>
>
> Regards,
>
>>   Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo <
>> edwinye...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>>>
>>> Would like to check, will StandardTokenizerFactory works well for
>>> indexing
>>> both English and Chinese (Bilingual) documents, or do we need
>>> tokenizers
>>> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>>
>> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>
>
>

>>> --
>>> Charlie Hull
>>> Flax - Open Source Enterprise Search
>>>
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.flax.co.uk
>>>
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-30 Thread Zheng Lin Edwin Yeo
Hi Charlie,

Yes sure, I'm now finalising my testing with all the different tokenizer,
and trying to understand how each of the tokenizer actually works.
Hopefully will be able to share something useful about my experience once
I'm done with it.

Regards,
Edwin


On 30 September 2015 at 17:25, Charlie Hull  wrote:

> On 30/09/2015 10:13, Zheng Lin Edwin Yeo wrote:
>
>> Hi Charlie,
>>
>
> Hi Edwin,
>
>>
>> Thanks for your reply. Seems like quite a number of the chinese tokenizers
>> are not really compatible with the newer versions of Solr
>>
>> I'm also looking at HMMChineseTokenizer and JiebaTokenizer to see if they
>> are suitable to be used for Solr 5.x too.
>>
>
> I think there is a general lack of knowledge (at least in the
> non-Chinese-speaking community) about the best way to analyze Chinese
> content with Lucene/Solr - so if you can write up your experiences that
> would be great!
>
> Cheers
>
> Charlie
>
>
>
>> Regards,
>> Edwin
>>
>>
>> On 30 September 2015 at 16:20, Charlie Hull  wrote:
>>
>> On 30/09/2015 04:09, Zheng Lin Edwin Yeo wrote:
>>>
>>> Hi Charlie,


>>> Hi,
>>>
>>>
 I've checked that Paoding's code is written for Solr 3 and Solr 4
 versions.
 It is not written for Solr 5, thus I was unable to use it in my Solr 5.x
 version.


>>> I'm pretty sure we had to recompile it for v4.6 as wellit has been a
>>> little painful.
>>>
>>>
 Have you tried to use HMMChineseTokenizer and JiebaTokenizer as well?


>>> I don't think so.
>>>
>>>
>>> Charlie
>>>
>>>
 Regards,
 Edwin


 On 25 September 2015 at 18:46, Charlie Hull  wrote:

 On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:

>
> Hi Charlie,
>
>>
>> Thanks for your comment. I faced the compatibility issues with Paoding
>> when
>> I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code
>> was
>> optimised for Solr 3.6.
>>
>> Which version of Solr are you using when you tried on the Paoding?
>>
>>
>> Solr v4.6 I believe.
>
> Charlie
>
>
> Regards,
>
>> Edwin
>>
>>
>> On 25 September 2015 at 16:43, Charlie Hull 
>> wrote:
>>
>> On 23/09/2015 16:23, Alexandre Rafalovitch wrote:
>>
>>
>>> You may find the following articles interesting:
>>>
>>>



 http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
 ( a whole epic journey)
 https://dzone.com/articles/indexing-chinese-solr


 The latter article is great and we drew on it when helping a recent

>>> client
>>> with Chinese indexing. However, if you do use Paoding bear in mind
>>> that
>>> it
>>> has few if any tests and all the comments are in Chinese. We found a
>>> problem with it recently (it breaks the Lucene highlighters) and have
>>> submitted a patch:
>>> http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1
>>>
>>> Cheers
>>>
>>> Charlie
>>>
>>>
>>> Regards,
>>>
>>>Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo <
 edwinye...@gmail.com>
 wrote:

 Hi,


> Would like to check, will StandardTokenizerFactory works well for
> indexing
> both English and Chinese (Bilingual) documents, or do we need
> tokenizers
> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>
>
> Regards,
> Edwin
>
>
>
> --

>>> Charlie Hull
>>> Flax - Open Source Enterprise Search
>>>
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.flax.co.uk
>>>
>>>
>>>
>>>
>> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>
>
>

>>> --
>>> Charlie Hull
>>> Flax - Open Source Enterprise Search
>>>
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.flax.co.uk
>>>
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-29 Thread Zheng Lin Edwin Yeo
Hi Charlie,

I've checked that Paoding's code is written for Solr 3 and Solr 4 versions.
It is not written for Solr 5, thus I was unable to use it in my Solr 5.x
version.

Have you tried to use HMMChineseTokenizer and JiebaTokenizer as well?

Regards,
Edwin


On 25 September 2015 at 18:46, Charlie Hull  wrote:

> On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:
>
>> Hi Charlie,
>>
>> Thanks for your comment. I faced the compatibility issues with Paoding
>> when
>> I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code was
>> optimised for Solr 3.6.
>>
>> Which version of Solr are you using when you tried on the Paoding?
>>
>
> Solr v4.6 I believe.
>
> Charlie
>
>
>> Regards,
>> Edwin
>>
>>
>> On 25 September 2015 at 16:43, Charlie Hull  wrote:
>>
>> On 23/09/2015 16:23, Alexandre Rafalovitch wrote:
>>>
>>> You may find the following articles interesting:


 http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
 ( a whole epic journey)
 https://dzone.com/articles/indexing-chinese-solr


>>> The latter article is great and we drew on it when helping a recent
>>> client
>>> with Chinese indexing. However, if you do use Paoding bear in mind that
>>> it
>>> has few if any tests and all the comments are in Chinese. We found a
>>> problem with it recently (it breaks the Lucene highlighters) and have
>>> submitted a patch:
>>> http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1
>>>
>>> Cheers
>>>
>>> Charlie
>>>
>>>
>>> Regards,
  Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo <
 edwinye...@gmail.com>
 wrote:

 Hi,
>
> Would like to check, will StandardTokenizerFactory works well for
> indexing
> both English and Chinese (Bilingual) documents, or do we need
> tokenizers
> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>
>
> Regards,
> Edwin
>
>

>>> --
>>> Charlie Hull
>>> Flax - Open Source Enterprise Search
>>>
>>> tel/fax: +44 (0)8700 118334
>>> mobile:  +44 (0)7767 825828
>>> web: www.flax.co.uk
>>>
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-25 Thread Charlie Hull

On 23/09/2015 16:23, Alexandre Rafalovitch wrote:

You may find the following articles interesting:
http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
( a whole epic journey)
https://dzone.com/articles/indexing-chinese-solr


The latter article is great and we drew on it when helping a recent 
client with Chinese indexing. However, if you do use Paoding bear in 
mind that it has few if any tests and all the comments are in Chinese. 
We found a problem with it recently (it breaks the Lucene highlighters) 
and have submitted a patch: 
http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1


Cheers

Charlie


Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo  wrote:

Hi,

Would like to check, will StandardTokenizerFactory works well for indexing
both English and Chinese (Bilingual) documents, or do we need tokenizers
that are customised for chinese (Eg: HMMChineseTokenizerFactory)?


Regards,
Edwin



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-25 Thread Zheng Lin Edwin Yeo
Hi Charlie,

Thanks for your comment. I faced the compatibility issues with Paoding when
I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code was
optimised for Solr 3.6.

Which version of Solr are you using when you tried on the Paoding?

Regards,
Edwin


On 25 September 2015 at 16:43, Charlie Hull  wrote:

> On 23/09/2015 16:23, Alexandre Rafalovitch wrote:
>
>> You may find the following articles interesting:
>>
>> http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
>> ( a whole epic journey)
>> https://dzone.com/articles/indexing-chinese-solr
>>
>
> The latter article is great and we drew on it when helping a recent client
> with Chinese indexing. However, if you do use Paoding bear in mind that it
> has few if any tests and all the comments are in Chinese. We found a
> problem with it recently (it breaks the Lucene highlighters) and have
> submitted a patch:
> http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1
>
> Cheers
>
> Charlie
>
>
>> Regards,
>> Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo 
>> wrote:
>>
>>> Hi,
>>>
>>> Would like to check, will StandardTokenizerFactory works well for
>>> indexing
>>> both English and Chinese (Bilingual) documents, or do we need tokenizers
>>> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-25 Thread Charlie Hull

On 25/09/2015 11:43, Zheng Lin Edwin Yeo wrote:

Hi Charlie,

Thanks for your comment. I faced the compatibility issues with Paoding when
I tried it in Solr 5.1.0 and Solr 5.2.1, and I found out that the code was
optimised for Solr 3.6.

Which version of Solr are you using when you tried on the Paoding?


Solr v4.6 I believe.

Charlie


Regards,
Edwin


On 25 September 2015 at 16:43, Charlie Hull  wrote:


On 23/09/2015 16:23, Alexandre Rafalovitch wrote:


You may find the following articles interesting:

http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
( a whole epic journey)
https://dzone.com/articles/indexing-chinese-solr



The latter article is great and we drew on it when helping a recent client
with Chinese indexing. However, if you do use Paoding bear in mind that it
has few if any tests and all the comments are in Chinese. We found a
problem with it recently (it breaks the Lucene highlighters) and have
submitted a patch:
http://git.oschina.net/zhzhenqin/paoding-analysis/issues/1

Cheers

Charlie



Regards,
 Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo 
wrote:


Hi,

Would like to check, will StandardTokenizerFactory works well for
indexing
both English and Chinese (Bilingual) documents, or do we need tokenizers
that are customised for chinese (Eg: HMMChineseTokenizerFactory)?


Regards,
Edwin





--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk






--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-23 Thread Erick Erickson
In a word, no. The CJK languages in general don't
necessarily tokenize on whitespace so using a tokenizer
that uses whitespace as it's default tokenizer simply won't
work.

Have you tried it? It seems a simple test would get you
an answer faster.

Best,
Erick

On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> Would like to check, will StandardTokenizerFactory works well for indexing
> both English and Chinese (Bilingual) documents, or do we need tokenizers
> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>
>
> Regards,
> Edwin
>


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-23 Thread Rich Cariens
For what it's worth, we've had good luck using the ICUTokenizer and
associated filters. A native Chinese speaker here at the office gave us an
enthusiastic thumbs up on our Chinese search results. Your mileage may vary
of course.

On Wed, Sep 23, 2015 at 11:04 AM, Erick Erickson 
wrote:

> In a word, no. The CJK languages in general don't
> necessarily tokenize on whitespace so using a tokenizer
> that uses whitespace as it's default tokenizer simply won't
> work.
>
> Have you tried it? It seems a simple test would get you
> an answer faster.
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo  >
> wrote:
>
> > Hi,
> >
> > Would like to check, will StandardTokenizerFactory works well for
> indexing
> > both English and Chinese (Bilingual) documents, or do we need tokenizers
> > that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
> >
> >
> > Regards,
> > Edwin
> >
>


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-23 Thread Alexandre Rafalovitch
You may find the following articles interesting:
http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
( a whole epic journey)
https://dzone.com/articles/indexing-chinese-solr

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo  wrote:
> Hi,
>
> Would like to check, will StandardTokenizerFactory works well for indexing
> both English and Chinese (Bilingual) documents, or do we need tokenizers
> that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
>
>
> Regards,
> Edwin


Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-23 Thread Zheng Lin Edwin Yeo
Hi,

Would like to check, will StandardTokenizerFactory works well for indexing
both English and Chinese (Bilingual) documents, or do we need tokenizers
that are customised for chinese (Eg: HMMChineseTokenizerFactory)?


Regards,
Edwin


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-23 Thread Zheng Lin Edwin Yeo
Thanks Rich and Alexandre,

I'll probably test out the CJKTokenizer as well.
Previously I had some issues with the Paoding in Solr 5.2.1. But I haven't
tested it on 5.3.0 yet.

Regards,
Edwin


On 23 September 2015 at 23:23, Alexandre Rafalovitch 
wrote:

> You may find the following articles interesting:
>
> http://discovery-grindstone.blogspot.ca/2014/01/searching-in-solr-analyzing-results-and.html
> ( a whole epic journey)
> https://dzone.com/articles/indexing-chinese-solr
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 23 September 2015 at 10:41, Zheng Lin Edwin Yeo 
> wrote:
> > Hi,
> >
> > Would like to check, will StandardTokenizerFactory works well for
> indexing
> > both English and Chinese (Bilingual) documents, or do we need tokenizers
> > that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
> >
> >
> > Regards,
> > Edwin
>


Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-23 Thread Zheng Lin Edwin Yeo
Hi Erick,

Yes I did tried on the StandardTokenizer, and it seems to work well for
both English and Chinese words. Also, it has a faster indexing and response
time during query. Just that in StandardTokenizer which tokenize on
whitespace, the cutting of the chinese words will be indiviual character by
character, instead of a phrase which only custom chinese tokenizer can
support.

I tried on HMMChineseTokenizer too, and the indexing and querying time is
slower than StandardTokenizer. This works well for the Chinese words, but
there is alot of mis-match for the English words

Regards,
Edwin


On 23 September 2015 at 23:04, Erick Erickson 
wrote:

> In a word, no. The CJK languages in general don't
> necessarily tokenize on whitespace so using a tokenizer
> that uses whitespace as it's default tokenizer simply won't
> work.
>
> Have you tried it? It seems a simple test would get you
> an answer faster.
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo  >
> wrote:
>
> > Hi,
> >
> > Would like to check, will StandardTokenizerFactory works well for
> indexing
> > both English and Chinese (Bilingual) documents, or do we need tokenizers
> > that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
> >
> >
> > Regards,
> > Edwin
> >
>