Re: Query madness with NOTs...

2004-01-23 Thread Jim Hargrave
Thanks Otis,
 
(We are developing our own customized query parser - so I thought the dev group
was more appropriate.)
 
 Sorry. I forgot a + on the second query. Should be:
 
+A -B

+A -(-B)
 
My friend explained this as -B is nothing and -(nothing) is everything. When
you AND A with everything you get a different result as "A AND NOT B". IS this
right?
 
But I think you have already answered my main question. We already have a
customized version of QueryParser so we will probably need to fix these problems
in order to generate the right query. You say these things have been brought up
before, but not fixed. Is the solution considered difficult? I'll take a look
myself - if I make progress I'll repost with the code.
 
Jim
 
--- Jim Hargrave <[EMAIL PROTECTED]> wrote:
> Can anyone tell me why these two queries would produce different
> results:
>  
> +A -B
>  
> A -(-B) 

A and +A are not the same thing when you have multiple terms in a
query.

> Also, we are having a hard time understanding why the Query parser
> takes this
> query: "A AND NOT B" and returns this "+A +(-B)". Shouldn't this be
> "+A -B"?

Maybe it should.  QueryParser is not the smartest piece of code,
unfortunately, and this issue has been discussed several times before. 
It looks like QP is just translating things 'nicely' left to right and
not looking for 'AND NOT' and turning that into '-'.

Otis

The
> first gives incorrect results, the later works as expected.
>  
>  
> Jim 
>  
>  
> 



Re: Are deleted words allowed in a sloppy phrase query?

2004-01-09 Thread Jim Hargrave
Thanks Eric, Sorry about the personal post. Groupwise must not be posting as it should 
- I see it locally but must not have gone out to the mailing list. 
 
>From your description I may have no choice but to hack a custom version of Lucene. I 
>do think that a "string edit distance" version of PhraseQuery would be benificial. If 
>you break your words into character ngrams it would allow you to search languages 
>which have no easy stemming algorithms or word boundries (like Thai, Cambodian, 
>Laotion etc..). There are some ngram based IR systems out there that show this works 
>pretty good for English at least. Since we are only interested in key word matching 
>it does a fair job for the languages we have tried.
 
If anybody else has an idea that would allow me to modify PhraseQuery to do a full 
"String edit distance" search I would appreciate it. 
 
Jim Hargrave

>>> "Erik Hatcher" <[EMAIL PROTECTED]> 01/08/04 01:43PM >>>
On Jan 7, 2004, at 3:54 PM, Jim Hargrave wrote:
> Looks like I will have to implement my own PhraseQuery that uses a 
> standard string edit distance measure. What is the easiest way to do 
> this? Should I override PhraseQuery - then override the 
> SloppyPhraseScorer? I have my own query parser so I can make any 
> adjustments needed when building aquery.

Probably best to keep this on the lucene-user e-mail list, but it is 
non-trivial to implement a custom Query.   While PhraseQuery itself can 
be extended, there are several pieces it uses which are currently 
scoped at package visibility level only.

Even if you are using the built-in QueryParser, you can override the 
method that constructs the PhraseQuery.

>  BTW: We have implemented a multilingual key word in context 
> application that provides exact, stemmed and fuzzy search for ANY 
> language. Well we will have fuzzy search when I finish these 
> modifications. Lucene rules!
>

Nice!

Erik






Re: Geting exact term positions for each document insideacollect method...

2003-07-01 Thread Jim Hargrave
Our application indexes and retreieves sentences from a large database. Our terms are 
overlapping characters (n-grams). In order to calculate our custom score we need to 
know the (relative) position of each n-gram in the matched sentences. I'm currently 
using a boolen query (each n-ngram in a big 'OR' statement). I will investigate 
customizing the query as you suggest. 

Basically we are using Lucene as a Translation Memeory tool! Pretty cool. Lucene is 
wonderful and I think we can use it in many of our linguistic projects (Terminlogy, 
concordance, TM etc.).

Jim

>>> [EMAIL PROTECTED] 06/30/03 10:56 AM >>>
Jim Hargrave wrote:
> I've defined my own collector (I want the raw score before it is normalized between 
> 1.0 and 0.0). For each document I need to know the the matching term positions in 
> the document.  I've seen the methods in IndexReader, but how can I access them 
> inside my collect method? Are there other methods I am missing? 

No, this information is not available to the hit collector.

Why do you need this?  If it is only for summaries, then you're probably 
better off re-tokenizing those few documents that you wish to summarize. 
  If it is for query evaluation, then you're probably better off writing 
a new class of query (which is non-trivial).

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


==


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Geting exact term positions for each document inside a collectmethod...

2003-06-19 Thread Jim Hargrave
I've defined my own collector (I want the raw score before it is normalized between 
1.0 and 0.0). For each document I need to know the the matching term positions in the 
document.  I've seen the methods in IndexReader, but how can I access them inside my 
collect method? Are there other methods I am missing? 
 

Term a = new Term("field", "a");
Term b = new Term("field", "b");
Term c = new Term("field", "c");

class MyCollector extends HitCollector
{
public final void collect(int doc, float score)
{
   // need to know all matching term positions for 'doc'.
  // build a bit vector marking the position of each matched term.
}
}

BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery(a), false, false);
bq.add(new TermQuery(b), false, false);
bq.add(new TermQuery(c), false, false);

HitCollector col = new MyCollector();

   searcher.search(bq, col);
   Hits hits = searcher.search(bq);  


--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


==


Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Jim Hargrave
Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But 
DASG+Lev does look interesting.
 
Our app is a linguistic application. We want to search for sentences which have many 
ngrams in common and rank them based on the score below. Similar to the TELLTALE 
system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se 
- we want to compute a score based on pure string similarity. Sentences are docs, 
ngrams are terms.
 
Jim

>>> [EMAIL PROTECTED] 06/05/03 03:55PM >>>
AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

>Our application is a string similarity searcher where the query is an input string 
>and we want to find all "fuzzy" variants of the input string in the DB.  The Score is 
>basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in 
>common, Q is the number of unique query terms and D is the number of unique document 
>terms. Our documents will be sentences.
> 
>I know Lucene has a fuzzy search capability - but I assume this would be very slow 
>since it must search through the entire term list to find candidates.
> 
>In order to do the calculation I will need to have 'C' - the number of terms in 
>common between query and document. Is there an API that I can call to get this info? 
>Any hints on what it will take to modify Lucene to handle these kinds of queries? 
>  
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 





--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


==


String similarity search vs. typcial IR application...

2003-06-06 Thread Jim Hargrave
Our application is a string similarity searcher where the query is an input string and 
we want to find all "fuzzy" variants of the input string in the DB.  The Score is 
basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in 
common, Q is the number of unique query terms and D is the number of unique document 
terms. Our documents will be sentences.
 
I know Lucene has a fuzzy search capability - but I assume this would be very slow 
since it must search through the entire term list to find candidates.
 
In order to do the calculation I will need to have 'C' - the number of terms in common 
between query and document. Is there an API that I can call to get this info? Any 
hints on what it will take to modify Lucene to handle these kinds of queries? 
 
BTW: 
Ever consider using Lucene for DNA searching? - this technique could also be used to 
search large DNA databases.
 
Thanks!
 
Jim Hargrave


--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


==