Re: Custom filters & document numbers
[EMAIL PROTECTED] wrote: Does this happen frequently? Like Stanislav has been asking... what sort of operations on the index cause the document number to change for any given document? Documents are only re-numbered after there have been deletions. Once there have been deletions, renumbering may be triggered by any document addition or index optimization. Once an index is optimized, no renumbering will be performed unril more deletions are made. If the document numbers change frequently, is there a straightforward way to modify Lucene to keep the document numbers the same for the life of the document? I'd like to have mappings in my sql database that point to the document numbers that Lucene search returns in its Hits objects. If you require a persistent document id that survives deletions, then add it as a field to your documents. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Custom filters & document numbers
I'm also interested in knowing what can change the doc numbers. Does this happen frequently? Like Stanislav has been asking... what sort of operations on the index cause the document number to change for any given document? If the document numbers change frequently, is there a straightforward way to modify Lucene to keep the document numbers the same for the life of the document? I'd like to have mappings in my sql database that point to the document numbers that Lucene search returns in its Hits objects. Thanks, -Tom- --- Stanislav Jordanov <[EMAIL PROTECTED]> wrote: > The first statement is clear to me: > I know that an IndexReader sees a 'snapshot' of the document set that was > taken in the moment of the Reader's creation. > > What I don't know is whether this 'snapshot' has also its doc numbers fixed > or they may change asynchronously. > And another thing I don't know is what are the index operations that may > cause the (doc -> doc number) mapping to change. > Is it only after delete or there are other ocasions, or I'd better not count > on this at all. > > StJ > > - Original Message - > From: "Vanlerberghe, Luc" <[EMAIL PROTECTED]> > To: "Lucene Users List" > Sent: Thursday, February 24, 2005 4:07 PM > Subject: RE: Custom filters & document numbers > > > > An IndexReader will always see the same set of documents. > > Even if another process deletes some documents, adds new ones or > > optimizes the complete index, your IndexReader instance will not see > > those changes. > > > > If you detect that the Lucene index changed (e.g. by calling > > IndexReader.getCurrentVersion(...) once in a while), you should close > > and reopen your 'current' IndexReader and recalculate any data that > > relies on the Lucene document numbers. > > > > Regards, Luc. > > > > -Original Message- > > From: Stanislav Jordanov [mailto:[EMAIL PROTECTED] > > Sent: donderdag 24 februari 2005 14:18 > > To: Lucene Users List > > Subject: Custom filters & document numbers > > > > Given an IndexReader a custom filter is supposed to create a bit set, > > that maps each document numbers to {'visible', 'invisible'} On the other > > hand, it is stated that Lucene is allowed to change document numbers. > > Is it guaranteed that this BitSet's view of document numbers won't > > change while the BitSet is still in use (or perhaps the corresponding > > IndexReader is still opened) ? > > > > And another (more low-level) question. > > When Lucene may change document numbers? > > Is it only when the index is optimized after there has been a delete > > operation? > > > > Regards: StJ > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Custom filters & document numbers
The first statement is clear to me: I know that an IndexReader sees a 'snapshot' of the document set that was taken in the moment of the Reader's creation. What I don't know is whether this 'snapshot' has also its doc numbers fixed or they may change asynchronously. And another thing I don't know is what are the index operations that may cause the (doc -> doc number) mapping to change. Is it only after delete or there are other ocasions, or I'd better not count on this at all. StJ - Original Message - From: "Vanlerberghe, Luc" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Thursday, February 24, 2005 4:07 PM Subject: RE: Custom filters & document numbers > An IndexReader will always see the same set of documents. > Even if another process deletes some documents, adds new ones or > optimizes the complete index, your IndexReader instance will not see > those changes. > > If you detect that the Lucene index changed (e.g. by calling > IndexReader.getCurrentVersion(...) once in a while), you should close > and reopen your 'current' IndexReader and recalculate any data that > relies on the Lucene document numbers. > > Regards, Luc. > > -Original Message- > From: Stanislav Jordanov [mailto:[EMAIL PROTECTED] > Sent: donderdag 24 februari 2005 14:18 > To: Lucene Users List > Subject: Custom filters & document numbers > > Given an IndexReader a custom filter is supposed to create a bit set, > that maps each document numbers to {'visible', 'invisible'} On the other > hand, it is stated that Lucene is allowed to change document numbers. > Is it guaranteed that this BitSet's view of document numbers won't > change while the BitSet is still in use (or perhaps the corresponding > IndexReader is still opened) ? > > And another (more low-level) question. > When Lucene may change document numbers? > Is it only when the index is optimized after there has been a delete > operation? > > Regards: StJ > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Custom filters & document numbers
An IndexReader will always see the same set of documents. Even if another process deletes some documents, adds new ones or optimizes the complete index, your IndexReader instance will not see those changes. If you detect that the Lucene index changed (e.g. by calling IndexReader.getCurrentVersion(...) once in a while), you should close and reopen your 'current' IndexReader and recalculate any data that relies on the Lucene document numbers. Regards, Luc. -Original Message- From: Stanislav Jordanov [mailto:[EMAIL PROTECTED] Sent: donderdag 24 februari 2005 14:18 To: Lucene Users List Subject: Custom filters & document numbers Given an IndexReader a custom filter is supposed to create a bit set, that maps each document numbers to {'visible', 'invisible'} On the other hand, it is stated that Lucene is allowed to change document numbers. Is it guaranteed that this BitSet's view of document numbers won't change while the BitSet is still in use (or perhaps the corresponding IndexReader is still opened) ? And another (more low-level) question. When Lucene may change document numbers? Is it only when the index is optimized after there has been a delete operation? Regards: StJ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Custom filters & document numbers
Given an IndexReader a custom filter is supposed to create a bit set, that maps each document numbers to {'visible', 'invisible'} On the other hand, it is stated that Lucene is allowed to change document numbers. Is it guaranteed that this BitSet's view of document numbers won't change while the BitSet is still in use (or perhaps the corresponding IndexReader is still opened) ? And another (more low-level) question. When Lucene may change document numbers? Is it only when the index is optimized after there has been a delete operation? Regards: StJ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document numbers and ids
On Sunday 06 February 2005 20:00, Chris Hostetter wrote: > : > care about their content. I only want to know a particular numeric > : > field from > : > document (id of document's category). > : > I also need to know how many docs in category were found, so I can't > : > index > : > : You should explore the use of IndexReader. Index your documents with > : category id field, and use the methods on IndexReader to find all > : unique categories (TermEnum). > > to expand on erik's suggestion: once you know the complete list of > categories you iterate over then and execute your search once per > category, filtering each time on the category Id (to determine the number > of results from that category). Nah, I did a little more tricky thing, but promises to be faster (I have 12K categories now and there will be more). I index docs' categories ids as zero-padded keywords. Then I do search for documents, sorting them by category id. Then I iterate Hits following the scheme: 1. I have the cache that holds ids of documents in current category. 2. Each time I see doc id that is not in current category, I read that document and reload cache with it's category data. So if I found docs in N categories (N usually is not big), I really need to read exactly N docs from disk, the rest of iterating through Hits is just checking cache (because I sort by category). It's a pity lucene doesn't have IndexSearcher.search( Query, Sort, HitCollector ), but if I understood Hits properly, it gives me O( log2 ( doc_dum ) ) performance impact per resultset, which is perfectly acceptable. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document numbers and ids
: > care about their content. I only want to know a particular numeric : > field from : > document (id of document's category). : > I also need to know how many docs in category were found, so I can't : > index : You should explore the use of IndexReader. Index your documents with : category id field, and use the methods on IndexReader to find all : unique categories (TermEnum). to expand on erik's suggestion: once you know the complete list of categories you iterate over then and execute your search once per category, filtering each time on the category Id (to determine the number of results from that category). -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document numbers and ids
On Feb 4, 2005, at 12:24 PM, Simeon Koptelov wrote: By "renumbered", it means it squeezes out holes left by deletes. The actual order does not change and thus does not affect a Sort.INDEXORDER sort. Documents are stored in the index in the order that they were indexed - nothing changes this order. Document id's are not permanent if deletes occur followed by an optimize. Thanks for clarification, Erik. Could you answer one more question: can I control the assignment of document numbers during indexing? No, you cannot control Lucene's document id scheme - it is basically "for internal use". Maybe I should explain, why I'm asking. I'm searching for documents, but for most (almost all) of them I don't really care about their content. I only want to know a particular numeric field from document (id of document's category). I also need to know how many docs in category were found, so I can't index categories instead of docs. The result set can be pertty big (30K) and all must be handled in inner loop. So I wanna use HitCollector and assign intervals of ids to categories of documents. Following this way, there's no need to actually retrieve document in inner loop. Am I on the right way? You should explore the use of IndexReader. Index your documents with category id field, and use the methods on IndexReader to find all unique categories (TermEnum). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Document numbers and ids
> By "renumbered", it means it squeezes out holes left by deletes. The > actual order does not change and thus does not affect a Sort.INDEXORDER > sort. > > Documents are stored in the index in the order that they were indexed - > nothing changes this order. Document id's are not permanent if deletes > occur followed by an optimize. Thanks for clarification, Erik. Could you answer one more question: can I control the assignment of document numbers during indexing? It would be very handy for me to have categories of documents aligned on some boudaries, e.g. category N numbers start on N*1. Obviously, there will be some holes in numeration with this scheme. Maybe I should explain, why I'm asking. I'm searching for documents, but for most (almost all) of them I don't really care about their content. I only want to know a particular numeric field from document (id of document's category). I also need to know how many docs in category were found, so I can't index categories instead of docs. The result set can be pertty big (30K) and all must be handled in inner loop. So I wanna use HitCollector and assign intervals of ids to categories of documents. Following this way, there's no need to actually retrieve document in inner loop. Am I on the right way? Mood: wondering, why SQL GROUP BY works so fast. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document numbers and ids
On Feb 4, 2005, at 9:49 AM, Simeon Koptelov wrote: The LiA says that I can use Sort.INDEXORDER when indexing order is relevant and gives an example where documents' ids (got from Hits.id() ) are increasing from top to bottom of resultset. Are that ids the same thing as document numbers? Yes, id is the same as document number. If they are the same, how can it be that they are preserved during indexing process? LiA says that documents are renumbered when merging segments. By "renumbered", it means it squeezes out holes left by deletes. The actual order does not change and thus does not affect a Sort.INDEXORDER sort. Documents are stored in the index in the order that they were indexed - nothing changes this order. Document id's are not permanent if deletes occur followed by an optimize. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document numbers and ids
The LiA says that I can use Sort.INDEXORDER when indexing order is relevant and gives an example where documents' ids (got from Hits.id() ) are increasing from top to bottom of resultset. Are that ids the same thing as document numbers? If they are the same, how can it be that they are preserved during indexing process? LiA says that documents are renumbered when merging segments. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: document numbers
Hi Jonathan, > Yet another burning question :-). Can someone explain how the document > numbers in Lucene documents work? For example, the TermDocs.doc() > method returns "the current doc number." How can I get this doc number > if I just have a Document? > I don't think you can. A document does not even have to be indexed yet. So either you're dealing with some document found in the index, then you should have the document number already, or you have a document independently from the index, then you have to analyze the documents content and count yourself. Note that term vector support might be useful if you're interested in more than one term (but that requires the document number again). Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
document numbers
Yet another burning question :-). Can someone explain how the document numbers in Lucene documents work? For example, the TermDocs.doc() method returns "the current doc number." How can I get this doc number if I just have a Document? Here's the context. I'm working on implementing Justin Zobel's similarity functions (from his paper "Exploring the Similarity Space," mentioned previously by Ian) in a retrieval system based on Lucene. I have a Lucene document and a term, for which I can get a TermDocs object from an IndexReader. I then want to get the number of occurrences of that term in the specific Lucene document mentioned above. I would like to call TermDocs.skip(<"doc # of Lucene doc">) and then freq() to get this frequency. Is this the way to get the frequency of a term in a specific document? Jonathan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Iterernal Document Numbers
Joe Rayguy wrote: So, assuming that sort as implemented in 1.4 doesn't work for me, my original question still stands. Do I have to worry about merges that occur as documents are added, or do I only have to rebuild my array after optimizations? Or, alternatively, how did everyone sort before 1.4? If you've made deletions since the last optimize, then document numbers can decrease as you do additions, when the deletions are dropped. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Iterernal Document Numbers
Tim, Thanks for your reply. Believe me, the sorting in 1.4 is greatly welcomed, but I don't think it fits my particular needs. My criteria for the sort can change without notice on the fly, and I'd rather not have to recreate the index to accomdate this. Using 1.3 I played with including the sort criteria (in integer form) in the index, and had every query include a range in conjunction with a custom similarity that essentially made the "sort" field the heaviest criteria. This worked, although I had concerns about the speed (including a range in every query that spanned all documents), and I still had the issue of how to change scores easily. I went as far as to hack up a tool that could modify the index itself and change the termfreq for my sorting field, but this is not the road I want to go down! :) So, assuming that sort as implemented in 1.4 doesn't work for me, my original question still stands. Do I have to worry about merges that occur as documents are added, or do I only have to rebuild my array after optimizations? Or, alternatively, how did everyone sort before 1.4? Joe --- Tim Jones <[EMAIL PROTECTED]> wrote: > the 1.4 release contains sorting code that sorts > similarly to your description. You can get the > latest 1.4 release here: > > http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc2/ > > look at org.apache.lucene.search.Sort > > > > -Original Message- > > From: Joe Rayguy [mailto:[EMAIL PROTECTED] > > Sent: Thursday, April 01, 2004 11:58 AM > > To: [EMAIL PROTECTED] > > Subject: Iterernal Document Numbers > > > > > > Hi, > > > > I apologize if this has been answered before, but > is > > it safe to design an application that sorts hits > using > > an external array based on each hit's internal > > document ID? It seems simple enough to rebuild > the > > array after an optimization, but what about merges > > that > > occur in the course of adding documents? If I > plan on > > adding documents every minute or so recreating > this > > array with each addition doesn't seem feasible. > > > > Is there a recommended way to handle such an array > for > > sorting results? > > > > Thanks. > > > > __ > > Do you Yahoo!? > > Yahoo! Mail - More reliable, more storage, less > spam > > http://mail.yahoo.com > > > > > - > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Iterernal Document Numbers
the 1.4 release contains sorting code that sorts similarly to your description. You can get the latest 1.4 release here: http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc2/ look at org.apache.lucene.search.Sort > -Original Message- > From: Joe Rayguy [mailto:[EMAIL PROTECTED] > Sent: Thursday, April 01, 2004 11:58 AM > To: [EMAIL PROTECTED] > Subject: Iterernal Document Numbers > > > Hi, > > I apologize if this has been answered before, but is > it safe to design an application that sorts hits using > an external array based on each hit's internal > document ID? It seems simple enough to rebuild the > array after an optimization, but what about merges > that > occur in the course of adding documents? If I plan on > adding documents every minute or so recreating this > array with each addition doesn't seem feasible. > > Is there a recommended way to handle such an array for > sorting results? > > Thanks. > > __ > Do you Yahoo!? > Yahoo! Mail - More reliable, more storage, less spam > http://mail.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Iterernal Document Numbers
Hi, I apologize if this has been answered before, but is it safe to design an application that sorts hits using an external array based on each hit's internal document ID? It seems simple enough to rebuild the array after an optimization, but what about merges that occur in the course of adding documents? If I plan on adding documents every minute or so recreating this array with each addition doesn't seem feasible. Is there a recommended way to handle such an array for sorting results? Thanks. __ Do you Yahoo!? Yahoo! Mail - More reliable, more storage, less spam http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: document numbers
Moving this to lucene-user. I believe so, yes. However, doc IDs are not persistent. They get reused. So running this same code sometime in the future may give you a different Document for doc ID 45. Otis --- Maurice Coyle <[EMAIL PROTECTED]> wrote: > hi, > > i wonder could someone tell me if document numbers in a lucene index > are > consistent? > that is, if i have an IndexReader read and i have the following code: > > Document doc = read.document(45); > > and later on, i have the following: > > Term term = new Term(...); > > TermDocs termdocs = read.termDocs(term); > while(termdocs.next()) > { > int docnum = termdocs.doc(); > > if(docnum==45)System.out.println("found it!"); > } > > when the above code prints out that it has found it, is it referring > to the > same document as in the first code snippet above? > > sorry if this is unclear, if so just say the word and i'll elaborate. > > thanks, > maurice > __ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]