Re: Term Frequency for phrases

2010-01-08 Thread Erick Erickson
What are the associated Analyzers for your Gene and Token? Because if they're NOT something akin to KeywordAnalyzer, you have a problem. Specifically, most of the "regular" tokenizers will break this stream up into three separate terms, "brain", "natriuetic", and "peptide". If that's the case, the

Re: Term Frequency for phrases

2010-01-08 Thread Jason Rutherglen
I'm not going to go into too much code level detail, however I'd index the phrases using tri-gram shingles, and as uni-grams. I think this'll give you the results you're looking for. You'll be able to quickly recall the count of a given phrase aka tri-gram such as "blue_shorts_burough" On Fri, J

Re: Term Frequency for phrases

2010-01-08 Thread hrishim
@All : Elaborating the problem The phrase is being indexed as a single token ... I have a Gene tag in the xml document which is like brain natriuretic peptide This phrase is present in the abstract text for the given document . Code is as : doc.add(new Field("Gene", geneName, Field.Store.YES

Re: Term Frequency for phrases

2010-01-08 Thread Grant Ingersoll
When do you detect that they are phrases? During indexing or during search? On Jan 8, 2010, at 5:16 AM, hrishim wrote: > > Hi . > I have phrases like brain natriuretic peptide indexed as a single token > using Lucene. > When I calculate the term frequency for the same the count is 0 since the

Re: Term Frequency for phrases

2010-01-08 Thread Erick Erickson
On a quick read, your statements are contradictory <<>> <<>> Either "brain natriuretic peptide" is a single token/term or it's not Are you sure you're not confusing indexing and storing? What analyzer are you using at index time? Erick On Fri, Jan 8, 2010 at 5:16 AM, hrishim wrote:

Re: Term Frequency for phrases

2010-01-08 Thread Michael McCandless
Issue a PhraseQuery and count how many hits came back? Is that too slow? If so, you could detect all phrases during indexing and add them as tokens to the index? Mike On Fri, Jan 8, 2010 at 5:16 AM, hrishim wrote: > > Hi . > I have phrases like brain natriuretic peptide indexed as a single tok