Near performance question
Based on the nature of our documents, we sometimes experience extremely long response times when executing NEAR operations against a document (sometimes well over minutes - even though the operation is restricted to a single document). Our analysis of the code indicates (we think): It looks up each of the terms in the word.dbx file. It intersects the occurrence lists. (So far so good!) It takes each gid found in the occurrence list and: finds its parent right up until the root of the document (in dom.dbx). Traverses the tree depth-first until it finds the node text of interest. Does the expected scan to find out if the term distance requirement is satisfied. We did some timings on our document (Rusticus). It started off taking 1 second per occ and grew to 25 seconds. If we changed the dom.dbx buffers, we got significant improvement, but still relatively slow (343 occs). QUESTION: Seems to us the occs are ordered by gid (and we don't do any updating). Is there a simple way to make use of the positioning information of the tree levels for the prior occurrence on the current occurrence so that we don't have to start again from the document root? Thanks, Joe Paulsen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
Hope this isn't out of context - but Dan makes a very valid point. Besides the potential performance slowdown if NLP was always applied to a users query - there are times that an exact term match is desired without the query expansion that an NLP process normally requires. Joe - Original Message - From: Chong, Herb [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, November 17, 2003 10:00 AM Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) show an example document. Herb -Original Message- From: Dan Quaroni [mailto:[EMAIL PROTECTED] Sent: Monday, November 17, 2003 9:48 AM To: 'Lucene Users List' Subject: RE: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?]) My only concern with this being integrated into lucene is that it be done in a way that doesn't make its use mandatory. Lucene is powerful enough that it can be used for a lot of cases where NLP doesn't make any sense. For example, I think that sentence boundaries would severely screw up the project I recently did using lucene because there are no sentences, but there is punctuation. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Browsing the list of searchable terms
We would like to be able to allow our users to browse the index of searchable terms by entering a term (or term stem) to which we will respond with a list of words surrounding the entered word/stem (say 10 terms before and 10 terms after) along with the number of occurrences of the terms in the documents in the database/index. We're not quite sure how to do that yet, but our initial reading of the documentation indicates that maybe we should use IndexReader TermEnum(Term t). If we are reading this correctly, the method seems to be uni-directional only (i.e., there is no way to get the immediately preceding terms to the user entered word/stem). Is there a way to retrieve a bi-directional list of terms surrounding a given word/stem? Thanks, Joe Paulsen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
One direction phrase searches
It seems when I do a search such as covered wagon ~5 or the like, the systems disregards the order of my terms. I.E., it will find covered within 5 of wagon and it will also find wagon within 5 of covered. Is there anyway to make the system respond only to the order of the terms as entered in the query? Joe Paulsen
Re: Keyword search with space and wildcard
Brian, This seems akin to the Phrase Searching problem that I encountered (haven't heard anything back from my posting yet) - which goes as follows: I try to do the phrase search center* form but the system seems to simply ignore the wildcard (throws it away) when processing the search - so I get only results for center form. My guess is the parser is processing your search simply as if it were hello w. I've got no solution - was hoping to hear something from the list. Joe - Original Message - From: Brian Campbell [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 28, 2003 4:45 PM Subject: Keyword search with space and wildcard I've created and index that has a Keyword field in it. I'm trying to do a search on that field where my term has a space and the wildcard character in it. For example, I'll issue the following search: project_name:Hello w*. I have an entry in the project_name field of Hello world. I would expect to get a hit on this but I don't. Is this not the way Lucene behaves? Am I doing something wrong? Thanks. -Brian _ Help protect your PC: Get a free online virus scan at McAfee.com. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Problem using WILDCARDS when doing proximity searches
I am attempting to submit proximity searches which contain wildcards but the software just seems to ignore my wildcard when performing the query. Ex: center* form ~ Thanks for any advice. Joe Paulsen