Re: Query madness with NOTs...
Thanks Otis, (We are developing our own customized query parser - so I thought the dev group was more appropriate.) Sorry. I forgot a + on the second query. Should be: +A -B +A -(-B) My friend explained this as -B is nothing and -(nothing) is everything. When you AND A with everything you get a different result as "A AND NOT B". IS this right? But I think you have already answered my main question. We already have a customized version of QueryParser so we will probably need to fix these problems in order to generate the right query. You say these things have been brought up before, but not fixed. Is the solution considered difficult? I'll take a look myself - if I make progress I'll repost with the code. Jim --- Jim Hargrave <[EMAIL PROTECTED]> wrote: > Can anyone tell me why these two queries would produce different > results: > > +A -B > > A -(-B) A and +A are not the same thing when you have multiple terms in a query. > Also, we are having a hard time understanding why the Query parser > takes this > query: "A AND NOT B" and returns this "+A +(-B)". Shouldn't this be > "+A -B"? Maybe it should. QueryParser is not the smartest piece of code, unfortunately, and this issue has been discussed several times before. It looks like QP is just translating things 'nicely' left to right and not looking for 'AND NOT' and turning that into '-'. Otis The > first gives incorrect results, the later works as expected. > > > Jim > > >
Re: Are deleted words allowed in a sloppy phrase query?
Thanks Eric, Sorry about the personal post. Groupwise must not be posting as it should - I see it locally but must not have gone out to the mailing list. >From your description I may have no choice but to hack a custom version of Lucene. I >do think that a "string edit distance" version of PhraseQuery would be benificial. If >you break your words into character ngrams it would allow you to search languages >which have no easy stemming algorithms or word boundries (like Thai, Cambodian, >Laotion etc..). There are some ngram based IR systems out there that show this works >pretty good for English at least. Since we are only interested in key word matching >it does a fair job for the languages we have tried. If anybody else has an idea that would allow me to modify PhraseQuery to do a full "String edit distance" search I would appreciate it. Jim Hargrave >>> "Erik Hatcher" <[EMAIL PROTECTED]> 01/08/04 01:43PM >>> On Jan 7, 2004, at 3:54 PM, Jim Hargrave wrote: > Looks like I will have to implement my own PhraseQuery that uses a > standard string edit distance measure. What is the easiest way to do > this? Should I override PhraseQuery - then override the > SloppyPhraseScorer? I have my own query parser so I can make any > adjustments needed when building aquery. Probably best to keep this on the lucene-user e-mail list, but it is non-trivial to implement a custom Query. While PhraseQuery itself can be extended, there are several pieces it uses which are currently scoped at package visibility level only. Even if you are using the built-in QueryParser, you can override the method that constructs the PhraseQuery. > BTW: We have implemented a multilingual key word in context > application that provides exact, stemmed and fuzzy search for ANY > language. Well we will have fuzzy search when I finish these > modifications. Lucene rules! > Nice! Erik
Re: Geting exact term positions for each document insideacollect method...
Our application indexes and retreieves sentences from a large database. Our terms are overlapping characters (n-grams). In order to calculate our custom score we need to know the (relative) position of each n-gram in the matched sentences. I'm currently using a boolen query (each n-ngram in a big 'OR' statement). I will investigate customizing the query as you suggest. Basically we are using Lucene as a Translation Memeory tool! Pretty cool. Lucene is wonderful and I think we can use it in many of our linguistic projects (Terminlogy, concordance, TM etc.). Jim >>> [EMAIL PROTECTED] 06/30/03 10:56 AM >>> Jim Hargrave wrote: > I've defined my own collector (I want the raw score before it is normalized between > 1.0 and 0.0). For each document I need to know the the matching term positions in > the document. I've seen the methods in IndexReader, but how can I access them > inside my collect method? Are there other methods I am missing? No, this information is not available to the hit collector. Why do you need this? If it is only for summaries, then you're probably better off re-tokenizing those few documents that you wish to summarize. If it is for query evaluation, then you're probably better off writing a new class of query (which is non-trivial). Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- This message may contain confidential information, and is intended only for the use of the individual(s) to whom it is addressed. == - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Geting exact term positions for each document inside a collectmethod...
I've defined my own collector (I want the raw score before it is normalized between 1.0 and 0.0). For each document I need to know the the matching term positions in the document. I've seen the methods in IndexReader, but how can I access them inside my collect method? Are there other methods I am missing? Term a = new Term("field", "a"); Term b = new Term("field", "b"); Term c = new Term("field", "c"); class MyCollector extends HitCollector { public final void collect(int doc, float score) { // need to know all matching term positions for 'doc'. // build a bit vector marking the position of each matched term. } } BooleanQuery bq = new BooleanQuery(); bq.add(new TermQuery(a), false, false); bq.add(new TermQuery(b), false, false); bq.add(new TermQuery(c), false, false); HitCollector col = new MyCollector(); searcher.search(bq, col); Hits hits = searcher.search(bq); -- This message may contain confidential information, and is intended only for the use of the individual(s) to whom it is addressed. ==
Re: String similarity search vs. typcial IR application...
Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But DASG+Lev does look interesting. Our app is a linguistic application. We want to search for sentences which have many ngrams in common and rank them based on the score below. Similar to the TELLTALE system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se - we want to compute a score based on pure string similarity. Sentences are docs, ngrams are terms. Jim >>> [EMAIL PROTECTED] 06/05/03 03:55PM >>> AFAIK Lucene is not able to look DNA strings up effectively. You would use DASG+Lev (see my previous post - 05/30/2003 1916CEST). -g- Jim Hargrave wrote: >Our application is a string similarity searcher where the query is an input string >and we want to find all "fuzzy" variants of the input string in the DB. The Score is >basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in >common, Q is the number of unique query terms and D is the number of unique document >terms. Our documents will be sentences. > >I know Lucene has a fuzzy search capability - but I assume this would be very slow >since it must search through the entire term list to find candidates. > >In order to do the calculation I will need to have 'C' - the number of terms in >common between query and document. Is there an API that I can call to get this info? >Any hints on what it will take to modify Lucene to handle these kinds of queries? > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- This message may contain confidential information, and is intended only for the use of the individual(s) to whom it is addressed. ==
String similarity search vs. typcial IR application...
Our application is a string similarity searcher where the query is an input string and we want to find all "fuzzy" variants of the input string in the DB. The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences. I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates. In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? BTW: Ever consider using Lucene for DNA searching? - this technique could also be used to search large DNA databases. Thanks! Jim Hargrave -- This message may contain confidential information, and is intended only for the use of the individual(s) to whom it is addressed. ==