What are the associated Analyzers for your Gene and Token?
Because if they're NOT something akin to KeywordAnalyzer, you
have a problem. Specifically, most of the "regular" tokenizers will
break this stream up into three separate terms,
"brain", "natriuetic", and "peptide". If that's the case, the
I'm not going to go into too much code level detail, however I'd index
the phrases using tri-gram shingles, and as uni-grams. I think
this'll give you the results you're looking for. You'll be able to
quickly recall the count of a given phrase aka tri-gram such as
"blue_shorts_burough"
On Fri, J
@All : Elaborating the problem
The phrase is being indexed as a single token ...
I have a Gene tag in the xml document which is like
brain natriuretic peptide
This phrase is present in the abstract text for the given document .
Code is as :
doc.add(new Field("Gene", geneName, Field.Store.YES
When do you detect that they are phrases? During indexing or during search?
On Jan 8, 2010, at 5:16 AM, hrishim wrote:
>
> Hi .
> I have phrases like brain natriuretic peptide indexed as a single token
> using Lucene.
> When I calculate the term frequency for the same the count is 0 since the
On a quick read, your statements are contradictory
<<>>
<<>>
Either "brain natriuretic peptide" is a single token/term or it's not
Are you sure you're not confusing indexing and storing? What
analyzer are you using at index time?
Erick
On Fri, Jan 8, 2010 at 5:16 AM, hrishim wrote:
Issue a PhraseQuery and count how many hits came back? Is that too
slow? If so, you could detect all phrases during indexing and add
them as tokens to the index?
Mike
On Fri, Jan 8, 2010 at 5:16 AM, hrishim wrote:
>
> Hi .
> I have phrases like brain natriuretic peptide indexed as a single tok