Re: any docs on solr.EdgeNGramFilterFactory?

Robert Muir Fri, 13 Nov 2009 15:43:53 -0800

ah, thanks, i'll tentatively set one in the future, but definitely not 2.9.x


more just to show you the idea, you can do different things depending on
different runs of writing systems in text.
but it doesnt solve everything: you only know its Latin script, not english,
so you can't safely automatically do anything like stemming.

say your content is only chinese, english:

the analyzer won't know your latin script text is english, versus say,
french from the unicode, so it won't stem it.
but that analyzer will lowercase it. it won't know if your ideographs are
chinese or japanese, but it will use n-gram tokenization, you get the drift.

in that impl, it puts the script code in the flags so downstream you could
do something like stemming if you happen to know more than is evident from
the unicode.

On Fri, Nov 13, 2009 at 6:23 PM, Peter Wolanin <peter.wola...@acquia.com>wrote:

> Thanks for the link - there doesn't seem a be a fix version specified,
> so I guess this will not officially ship with lucene 2.9?
>
> -Peter
>
> On Wed, Nov 11, 2009 at 10:36 PM, Robert Muir <rcm...@gmail.com> wrote:
> > Peter, here is a project that does this:
> > http://issues.apache.org/jira/browse/LUCENE-1488
> >
> >
> >> That's kind of interesting - in general can I build a custom tokenizer
> >> from existing tokenizers that treats different parts of the input
> >> differently based on the utf-8 range of the characters?  E.g. use a
> >> porter stemmer for stretches of Latin text and n-gram or something
> >> else for CJK?
> >>
> >> -Peter
> >>
> >> On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic
> >> <otis_gospodne...@yahoo.com> wrote:
> >> > Yes, that's the n-gram one.  I believe the existing CJK one in Lucene
> is
> >> really just an n-gram tokenizer, so no different than the normal n-gram
> >> tokenizer.
> >> >
> >> > Otis
> >> > --
> >> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >> >
> >> >
> >> >
> >> > ----- Original Message ----
> >> >> From: Peter Wolanin <peter.wola...@acquia.com>
> >> >> To: solr-user@lucene.apache.org
> >> >> Sent: Tue, November 10, 2009 7:34:37 PM
> >> >> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
> >> >>
> >> >> So, this is the normal N-gram one?  NGramTokenizerFactory
> >> >>
> >> >> Digging deeper - there are actualy CJK and Chinese tokenizers in the
> >> >> Solr codebase:
> >> >>
> >> >>
> >>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
> >> >>
> >>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
> >> >>
> >> >> The CJK one uses the lucene CJKTokenizer
> >> >>
> >>
> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
> >> >>
> >> >> and there seems to be another one even that no one has wrapped into
> >> Solr:
> >> >>
> >>
> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
> >> >>
> >> >> So seems like the existing options are a little better than I
> thought,
> >> >> though it would be nice to have some docs on properly configuring
> >> >> these.
> >> >>
> >> >> -Peter
> >> >>
> >> >> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
> >> >> wrote:
> >> >> > Peter,
> >> >> >
> >> >> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but
> >> just
> >> >> n-grams.
> >> >> > Before you take the n-gram route, you may want to look at the smart
> >> Chinese
> >> >> analyzer in Lucene contrib (I think it works only for Simplified
> >> Chinese) and
> >> >> Sen (on java.net).  I also spotted a Korean analyzer in the wild a
> few
> >> months
> >> >> back.
> >> >> >
> >> >> > Otis
> >> >> > --
> >> >> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> >> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >> >> >
> >> >> >
> >> >> >
> >> >> > ----- Original Message ----
> >> >> >> From: Peter Wolanin
> >> >> >> To: solr-user@lucene.apache.org
> >> >> >> Sent: Tue, November 10, 2009 4:06:52 PM
> >> >> >> Subject: any docs on solr.EdgeNGramFilterFactory?
> >> >> >>
> >> >> >> This fairly recent blog post:
> >> >> >>
> >> >> >>
> >> >>
> >>
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> >> >> >>
> >> >> >> describes the use of the solr.EdgeNGramFilterFactory as the
> tokenizer
> >> >> >> for the index.  I don't see any mention of that tokenizer on the
> Solr
> >> >> >> wiki - is it just waiting to be added, or is there any other
> >> >> >> documentation in addition to the blog post?  In particular, there
> was
> >> >> >> a thread last year about using an N-gram tokenizer to enable
> >> >> >> reasonable (if not ideal) searching of CJK text, so I'd be curious
> to
> >> >> >> know how people are configuring their schema (with this
> tokenizer?)
> >> >> >> for that use case.
> >> >> >>
> >> >> >> Thanks,
> >> >> >>
> >> >> >> Peter
> >> >> >>
> >> >> >> --
> >> >> >> Peter M. Wolanin, Ph.D.
> >> >> >> Momentum Specialist,  Acquia. Inc.
> >> >> >> peter.wola...@acquia.com
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Peter M. Wolanin, Ph.D.
> >> >> Momentum Specialist,  Acquia. Inc.
> >> >> peter.wola...@acquia.com
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Peter M. Wolanin, Ph.D.
> >> Momentum Specialist,  Acquia. Inc.
> >> peter.wola...@acquia.com
> >>
> >
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Robert Muir
rcm...@gmail.com

Re: any docs on solr.EdgeNGramFilterFactory?

Reply via email to