Near performance question

2004-03-30 Thread Joe Paulsen
Based on the nature of our documents, we sometimes 
experience extremely long response times when executing
NEAR operations against a document (sometimes well over 
minutes - even though the operation is restricted
to a single document).

Our analysis of the code indicates (we think):

It looks up each of the terms in the word.dbx file. 

It intersects the occurrence lists. (So far so good!) 

It takes each gid found in the occurrence list and: 
finds its parent right up until the root of the document (in dom.dbx).
 
Traverses the tree depth-first until it finds the node text of interest. 

Does the expected scan to find out 
if the term distance requirement is satisfied. 

We did some timings on our document (Rusticus). 
It started off taking  1 second per occ and grew to 25 seconds. 

If we changed the dom.dbx buffers, we got significant 
improvement, but still relatively slow (343 occs). 

QUESTION:
Seems to us the occs are ordered by gid 
(and we don't do any updating).  Is there 
a simple way to make use of the positioning 
information of the tree levels for the prior 
occurrence on the current occurrence so that 
we don't have to start again from the 
document root? 

Thanks,

Joe Paulsen



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])

2003-11-17 Thread Joe Paulsen
Hope this isn't out of context - but Dan makes a very valid point.
Besides the potential performance slowdown if NLP was always applied to
a users query - there are times that an exact term match is desired
without the query expansion that an NLP process normally requires.

Joe

- Original Message - 
From: Chong, Herb [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, November 17, 2003 10:00 AM
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])


show an example document.

Herb

-Original Message-
From: Dan Quaroni [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 9:48 AM
To: 'Lucene Users List'
Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was
R e: Vector Space Model in Lucene?])


My only concern with this being integrated into lucene is that it be
done in
a way that doesn't make its use mandatory.  Lucene is powerful enough
that
it can be used for a lot of cases where NLP doesn't make any sense.  For
example, I think that sentence boundaries would severely screw up the
project I recently did using lucene because there are no sentences, but
there is punctuation.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Browsing the list of searchable terms

2003-09-16 Thread Joe Paulsen
We would like to be able to allow our users to browse the index of
searchable terms by entering a term (or term stem) to which we will
respond with a list of words surrounding the entered word/stem (say 10
terms before and 10 terms after) along with the number of occurrences of
the terms in the documents in the database/index.

We're not quite sure how to do that yet, but our initial reading of the
documentation indicates that maybe we should use IndexReader
TermEnum(Term t).  If we are reading this correctly, the method seems to
be uni-directional only (i.e., there is no way to get the immediately
preceding terms to the user entered word/stem).  Is there a way to
retrieve a bi-directional list of terms surrounding a given word/stem?

Thanks,

Joe Paulsen



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



One direction phrase searches

2003-09-02 Thread Joe Paulsen
It seems when I do a search such as covered wagon ~5 or the like,
the systems disregards the order of my terms.  I.E., it will find covered
within 5 of wagon and it will also find wagon within 5 of covered.

Is there anyway to make the system respond only to the order of the
terms as entered in the query?

Joe Paulsen

Re: Keyword search with space and wildcard

2003-08-28 Thread Joe Paulsen
Brian,

This seems akin to the Phrase Searching problem that I encountered (haven't
heard anything back from my posting yet) - which goes as follows:  I try to
do the phrase search center* form but the system seems to simply ignore
the wildcard (throws it away) when processing the search - so I get only
results for center form.  My guess is the parser is processing your search
simply as if it were hello w.

I've got no solution - was hoping to hear something from the list.

Joe

- Original Message - 
From: Brian Campbell [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, August 28, 2003 4:45 PM
Subject: Keyword search with space and wildcard


 I've created and index that has a Keyword field in it.  I'm trying to do a
 search on that field where my term has a space and the wildcard character
in
 it.  For example, I'll issue the following search:  project_name:Hello
w*.
   I have an entry in the project_name field of Hello world.  I would
 expect to get a hit on this but I don't.  Is this not the way Lucene
 behaves? Am I doing something wrong?  Thanks.

 -Brian

 _
 Help protect your PC: Get a free online virus scan at McAfee.com.
 http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Problem using WILDCARDS when doing proximity searches

2003-08-26 Thread Joe Paulsen
I am attempting to submit proximity searches which contain wildcards but the software 
just seems to ignore my wildcard when performing the query.

Ex:  center* form ~

Thanks for any advice.

Joe Paulsen