Re: EdgeNGramFilterFactory for Chinese characters

Tomoko Uchida Sun, 25 Oct 2015 22:17:17 -0700

> Will try to see if there is anyway to managed it by only a single field?


Of course you can try to create custom Tokenizer or TokenFilter that
perfectly meets your needs.
I would copy the source codes of EdgeNGramTokenFilter and modify
incrementToken() method. It seems reasonable way for me.
incrementToken() of EdgeNGramTokenFilter cannot be overrided, it is defined
as "final" on Solr 5, so subclassing will not work.
And corresponding custom TokenFilterFactory class is also needed. (See
EdgeNGramFilterFactory.)

If you are not familiar with both of Java and internal architecture of
Lucene/Solr,
custom classes can brought intricate bugs/problems into your system. Be
sure to keep them under control.

Anyway, checkout and look into java sources of TokenFilters included in
Solr if you have not yet.

Thanks,
Tomoko

2015-10-26 11:19 GMT+09:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>:

> Hi Tomoko,
>
> Thank you for your recommendation.
>
> I wasn't in favour of using copyField at first to have 2 separate fields
> for English and Chinese tokens, as it  not only increase the index size,
> but also slow down the performance for both indexing and querying.
>
> Will try to see if there is anyway to managed it by only a single field?
>
> Regards.
> Edwin
>
>
> On 25 October 2015 at 22:59, Tomoko Uchida <tomoko.uchida.1...@gmail.com>
> wrote:
>
> > Hi, Edwin,
> >
> > > This means it is better to have 2 separate fields for English and
> Chinese
> > words?
> >
> > Yes. I mean,
> > 1. Define FIELD_1 that use HMMChineseTokenizerFactory to extract English
> > and Chinese tokens.
> > 2. Define FIELD_2 that use PatternTokenizerFactory to extract English
> > tokens and EdgeNGramFilter to break up tokens to sub-strings.
> >     There might be some possible tokenizer/filter chains to extract
> English
> > tokens, please try and find the best way ;)
> > 3. Index original text to FIELD_1 to search tokens as they are. (for both
> > of English and Chinese words)
> > 4. Index original text to FIELD_2 to perform prefix match. (for English
> > words)
> > 5. Search FIELD_1 and FIELD_2 by using edismax query parser, etc.
> >
> > You can use copyField to index original text data to FIELD_1 and FIELD_2.
> > Downside of this method is that increase index size as you know.
> >
> > If you want to manage that *by one field*, I think you can create custom
> > token filter on your own... but it may be slightly advanced.
> >
> > Thanks,
> > Tomoko
> >
> > 2015-10-25 22:48 GMT+09:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com>:
> >
> > > Hi Tomoko,
> > >
> > > Thank you for your reply.
> > >
> > > > If you need to perform partial (prefix) match for **only English
> > words**,
> > > > you can create a separate field that keeps only English words (I've
> > never
> > > > tried that, but might be possible by PatternTokenizerFactory or other
> > > > tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to the
> > > field.
> > >
> > > This means it is better to have 2 separate fields for English and
> Chinese
> > > words?
> > > Not quite sure what you mean by that.
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > >
> > > On 25 October 2015 at 11:42, Tomoko Uchida <
> tomoko.uchida.1...@gmail.com
> > >
> > > wrote:
> > >
> > > > > I have rich-text documents that are in both English and Chinese,
> and
> > > > > currently I have EdgeNGramFilterFactory enabled during indexing,
> as I
> > > > need
> > > > > it for partial matching for English words. But this means it will
> > also
> > > > > break up each of the Chinese characters into different tokens.
> > > >
> > > > EdgeNGramFilterFactory creates sub-strings (prefixes) from each
> token.
> > > Its
> > > > behavior is independent of language.
> > > > If you need to perform partial (prefix) match for **only English
> > words**,
> > > > you can create a separate field that keeps only English words (I've
> > never
> > > > tried that, but might be possible by PatternTokenizerFactory or other
> > > > tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to the
> > > field.
> > > >
> > > > Hope it helps,
> > > > Tomoko
> > > >
> > > > 2015-10-23 13:04 GMT+09:00 Zheng Lin Edwin Yeo <edwinye...@gmail.com
> >:
> > > >
> > > > > Hi,
> > > > >
> > > > > Would like to check, is it good to use EdgeNGramFilterFactory for
> > > indexes
> > > > > that contains Chinese characters?
> > > > > Will it affect the accuracy of the search for Chinese words?
> > > > >
> > > > > I have rich-text documents that are in both English and Chinese,
> and
> > > > > currently I have EdgeNGramFilterFactory enabled during indexing,
> as I
> > > > need
> > > > > it for partial matching for English words. But this means it will
> > also
> > > > > break up each of the Chinese characters into different tokens.
> > > > >
> > > > > I'm using the HMMChineseTokenizerFactory for my tokenizer.
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Edwin
> > > > >
> > > >
> > >
> >
>

Re: EdgeNGramFilterFactory for Chinese characters

Reply via email to