Re: [lucy-user] C library - Phrase Searches

2017-12-04 Thread serkanmula...@gmail.com
Hi Nick,

Thank you very much for your response. I think you are right in saying that it 
should be very similar to an ANDQuery. I looked at the PhraseMatcher and saw 
that there is an additional for loop (over the query terms) in order to check 
the positions of the terms are consecutive (to ensure that it is a phrase). I 
was concerned about the implication of this for loop, but thinking one more 
time it multiplies the complexity of an ANDQuery with a small value (which is 
the number of terms in the phrase query). 

Thanks again,
Serkan



On 2017-12-04 07:20, Nick Wellnhofer  wrote: 
> On 28/11/2017 18:55, serkanmula...@gmail.com wrote:
> > My question is how is such queries being handled in the library. Is it by 
> > looking at the consecutive term positions in documents?
> 
> Yes.
> 
> > What is the performance impact for such queries?
> 
> This depends on how you quantify "performance impact", but in general, 
> performance should be similar to an ANDQuery of all terms in the phrase.
> 
> > Secondly how are they being scored? Is it still tf/idf? If so what is the 
> > definition of tf and of idf, for these queries?
> 
> It's still tf/idf. For idf, the sum of each term's idf is used. For tf, it's 
> the number of phrases in a document.
> 
> For more details, see PhraseQuery.c and PhraseMatcher.c in core/Lucy/Search.
> 
> Nick
> 


Re: [lucy-user] C library - Phrase Searches

2017-12-04 Thread Nick Wellnhofer

On 28/11/2017 18:55, serkanmula...@gmail.com wrote:

My question is how is such queries being handled in the library. Is it by 
looking at the consecutive term positions in documents?


Yes.


What is the performance impact for such queries?


This depends on how you quantify "performance impact", but in general, 
performance should be similar to an ANDQuery of all terms in the phrase.



Secondly how are they being scored? Is it still tf/idf? If so what is the 
definition of tf and of idf, for these queries?


It's still tf/idf. For idf, the sum of each term's idf is used. For tf, it's 
the number of phrases in a document.


For more details, see PhraseQuery.c and PhraseMatcher.c in core/Lucy/Search.

Nick


RE: [lucy-user] C library - Phrase Searches

2017-11-28 Thread Zebrowski, Zak
You want to look to see how the documents are being analyzed... Look at the 
Lucy::Analysis set of perl modules (for hints as to how this is handled in c).  
You can write your own analyzer to take into special cases for you.  $0.02

Zachary Zebrowski
Forensic Database Engineer / Division Mentoring Liaison
The MITRE Corporation
(W) 202-406-6346
(C) 571-232-5643
(AR) KM4ZZE

-Original Message-
From: serkanmula...@gmail.com [mailto:serkanmula...@gmail.com] 
Sent: Tuesday, November 28, 2017 12:56 PM
To: user@lucy.apache.org
Subject: [lucy-user] C library - Phrase Searches

Hi guys again :)

I have a question regarding the phrase searches and their scoring. As I see 
when we search for a phrase in quotation marks, e.g. "the united states", only 
messages that contain "the united states" are being returned. (to be more exact 
messages containing "the unite state" would have returned as well).

My question is how is such queries being handled in the library. Is it by 
looking at the consecutive term positions in documents? What is the performance 
impact for such queries?

Secondly how are they being scored? Is it still tf/idf? If so what is the 
definition of tf and of idf, for these queries?

Thanks as always,
Serkan