Re: Custom filters & document numbers

2005-03-01 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
Does this happen frequently?  Like Stanislav has been asking... what sort of
operations on the index cause the document number to change for any given
document?
Documents are only re-numbered after there have been deletions.  Once 
there have been deletions, renumbering may be triggered by any document 
addition or index optimization.  Once an index is optimized, no 
renumbering will be performed unril more deletions are made.

If the document numbers change frequently, is there a
straightforward way to modify Lucene to keep the document numbers the same for
the life of the document?  I'd like to have mappings in my sql database that
point to the document numbers that Lucene search returns in its Hits objects.
If you require a persistent document id that survives deletions, then 
add it as a field to your documents.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Custom filters & document numbers

2005-03-01 Thread tomsdepot-lucene
I'm also interested in knowing what can change the doc numbers.

Does this happen frequently?  Like Stanislav has been asking... what sort of
operations on the index cause the document number to change for any given
document?  If the document numbers change frequently, is there a
straightforward way to modify Lucene to keep the document numbers the same for
the life of the document?  I'd like to have mappings in my sql database that
point to the document numbers that Lucene search returns in its Hits objects.

Thanks,

-Tom-

--- Stanislav Jordanov <[EMAIL PROTECTED]> wrote:

> The first statement is clear to me:
> I know that an IndexReader sees a 'snapshot' of the document set that was
> taken in the moment of the Reader's creation.
> 
> What I don't know is whether this 'snapshot' has also its doc numbers fixed
> or they may change asynchronously.
> And another thing I don't know is what are the index operations that may
> cause the (doc -> doc number) mapping to change.
> Is it only after delete or there are other ocasions, or I'd better not count
> on this at all.
> 
> StJ
> 
> - Original Message - 
> From: "Vanlerberghe, Luc" <[EMAIL PROTECTED]>
> To: "Lucene Users List" 
> Sent: Thursday, February 24, 2005 4:07 PM
> Subject: RE: Custom filters & document numbers
> 
> 
> > An IndexReader will always see the same set of documents.
> > Even if another process deletes some documents, adds new ones or
> > optimizes the complete index, your IndexReader instance will not see
> > those changes.
> >
> > If you detect that the Lucene index changed (e.g. by calling
> > IndexReader.getCurrentVersion(...) once in a while), you should close
> > and reopen your 'current' IndexReader and recalculate any data that
> > relies on the Lucene document numbers.
> >
> > Regards, Luc.
> >
> > -Original Message-
> > From: Stanislav Jordanov [mailto:[EMAIL PROTECTED]
> > Sent: donderdag 24 februari 2005 14:18
> > To: Lucene Users List
> > Subject: Custom filters & document numbers
> >
> > Given an IndexReader a custom filter is supposed to create a bit set,
> > that maps each document numbers to {'visible', 'invisible'} On the other
> > hand, it is stated that Lucene is allowed to change document numbers.
> > Is it guaranteed that this BitSet's view of document numbers won't
> > change while the BitSet is still in use (or perhaps the corresponding
> > IndexReader is still opened) ?
> >
> > And another (more low-level) question.
> > When Lucene may change document numbers?
> > Is it only when the index is optimized after there has been a delete
> > operation?
> >
> > Regards: StJ
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Custom filters & document numbers

2005-02-24 Thread Stanislav Jordanov
The first statement is clear to me:
I know that an IndexReader sees a 'snapshot' of the document set that was
taken in the moment of the Reader's creation.

What I don't know is whether this 'snapshot' has also its doc numbers fixed
or they may change asynchronously.
And another thing I don't know is what are the index operations that may
cause the (doc -> doc number) mapping to change.
Is it only after delete or there are other ocasions, or I'd better not count
on this at all.

StJ

- Original Message - 
From: "Vanlerberghe, Luc" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 24, 2005 4:07 PM
Subject: RE: Custom filters & document numbers


> An IndexReader will always see the same set of documents.
> Even if another process deletes some documents, adds new ones or
> optimizes the complete index, your IndexReader instance will not see
> those changes.
>
> If you detect that the Lucene index changed (e.g. by calling
> IndexReader.getCurrentVersion(...) once in a while), you should close
> and reopen your 'current' IndexReader and recalculate any data that
> relies on the Lucene document numbers.
>
> Regards, Luc.
>
> -Original Message-
> From: Stanislav Jordanov [mailto:[EMAIL PROTECTED]
> Sent: donderdag 24 februari 2005 14:18
> To: Lucene Users List
> Subject: Custom filters & document numbers
>
> Given an IndexReader a custom filter is supposed to create a bit set,
> that maps each document numbers to {'visible', 'invisible'} On the other
> hand, it is stated that Lucene is allowed to change document numbers.
> Is it guaranteed that this BitSet's view of document numbers won't
> change while the BitSet is still in use (or perhaps the corresponding
> IndexReader is still opened) ?
>
> And another (more low-level) question.
> When Lucene may change document numbers?
> Is it only when the index is optimized after there has been a delete
> operation?
>
> Regards: StJ
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Custom filters & document numbers

2005-02-24 Thread Vanlerberghe, Luc
An IndexReader will always see the same set of documents.
Even if another process deletes some documents, adds new ones or
optimizes the complete index, your IndexReader instance will not see
those changes.

If you detect that the Lucene index changed (e.g. by calling
IndexReader.getCurrentVersion(...) once in a while), you should close
and reopen your 'current' IndexReader and recalculate any data that
relies on the Lucene document numbers.

Regards, Luc.

-Original Message-
From: Stanislav Jordanov [mailto:[EMAIL PROTECTED] 
Sent: donderdag 24 februari 2005 14:18
To: Lucene Users List
Subject: Custom filters & document numbers

Given an IndexReader a custom filter is supposed to create a bit set,
that maps each document numbers to {'visible', 'invisible'} On the other
hand, it is stated that Lucene is allowed to change document numbers.
Is it guaranteed that this BitSet's view of document numbers won't
change while the BitSet is still in use (or perhaps the corresponding
IndexReader is still opened) ?

And another (more low-level) question.
When Lucene may change document numbers?
Is it only when the index is optimized after there has been a delete
operation?

Regards: StJ


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Custom filters & document numbers

2005-02-24 Thread Stanislav Jordanov
Given an IndexReader a custom filter is supposed to create a bit set, that
maps each document numbers to {'visible', 'invisible'}
On the other hand, it is stated that Lucene is allowed to change document
numbers.
Is it guaranteed that this BitSet's view of document numbers won't change
while the BitSet is still in use (or perhaps the corresponding IndexReader
is still opened) ?

And another (more low-level) question.
When Lucene may change document numbers?
Is it only when the index is optimized after there has been a delete
operation?

Regards: StJ


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document numbers and ids

2005-02-06 Thread Simeon Koptelov
On Sunday 06 February 2005 20:00, Chris Hostetter wrote:
> : > care about their content. I only want to know a particular numeric
> : > field from
> : > document (id of document's category).
> : > I also need to know how many docs in category were found, so I can't
> : > index
> :
> : You should explore the use of IndexReader.  Index your documents with
> : category id field, and use the methods on IndexReader to find all
> : unique categories (TermEnum).
>
> to expand on erik's suggestion: once you know the complete list of
> categories you iterate over then and execute your search once per
> category, filtering each time on the category Id (to determine the number
> of results from that category).

Nah, I did a little more tricky thing, but promises to be faster (I have 12K 
categories now and there will be more).
I index docs' categories ids as zero-padded keywords. Then I do search for 
documents, sorting them by category id. Then I iterate Hits following the 
scheme: 
1. I have the cache that holds ids of documents in current category.
2. Each time I see doc id that is not in current category, I read that 
document and reload cache with it's category data. 

So if I found docs in N categories (N usually is not big), I really need to 
read exactly N docs from disk, the rest of iterating through Hits is just 
checking cache (because I sort by category).

It's a pity lucene doesn't have IndexSearcher.search( Query, Sort, 
HitCollector ), but if I understood Hits properly, it gives me O( log2
( doc_dum ) ) performance impact per resultset, which is perfectly 
acceptable.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document numbers and ids

2005-02-06 Thread Chris Hostetter
: > care about their content. I only want to know a particular numeric
: > field from
: > document (id of document's category).
: > I also need to know how many docs in category were found, so I can't
: > index

: You should explore the use of IndexReader.  Index your documents with
: category id field, and use the methods on IndexReader to find all
: unique categories (TermEnum).

to expand on erik's suggestion: once you know the complete list of
categories you iterate over then and execute your search once per
category, filtering each time on the category Id (to determine the number
of results from that category).



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document numbers and ids

2005-02-04 Thread Erik Hatcher
On Feb 4, 2005, at 12:24 PM, Simeon Koptelov wrote:
By "renumbered", it means it squeezes out holes left by deletes.  The
actual order does not change and thus does not affect a 
Sort.INDEXORDER
sort.

Documents are stored in the index in the order that they were indexed 
-
nothing changes this order.  Document id's are not permanent if 
deletes
occur followed by an optimize.
Thanks for clarification, Erik. Could you answer one more question: 
can I
control the assignment of document numbers during indexing?
No, you cannot control Lucene's document id scheme - it is basically 
"for internal use".

Maybe I should explain, why I'm asking.
I'm searching for documents, but for most (almost all) of them I don't 
really
care about their content. I only want to know a particular numeric 
field from
document (id of document's category).
I also need to know how many docs in category were found, so I can't 
index
categories instead of docs.
The result set can be pertty big (30K) and all must be handled in 
inner loop.
So I wanna use HitCollector and assign intervals of ids to categories 
of
documents. Following this way, there's no need to actually retrieve 
document
in inner loop.

Am I on the right way?
You should explore the use of IndexReader.  Index your documents with 
category id field, and use the methods on IndexReader to find all 
unique categories (TermEnum).

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Re: Document numbers and ids

2005-02-04 Thread Simeon Koptelov
> By "renumbered", it means it squeezes out holes left by deletes.  The 
> actual order does not change and thus does not affect a Sort.INDEXORDER 
> sort.
> 
> Documents are stored in the index in the order that they were indexed - 
> nothing changes this order.  Document id's are not permanent if deletes 
> occur followed by an optimize.

Thanks for clarification, Erik. Could you answer one more question: can I 
control the assignment of document numbers during indexing? It would be very 
handy for me to have categories of documents aligned on some boudaries, e.g. 
category N numbers start on  N*1. Obviously, there will be some holes in 
numeration with this scheme.

Maybe I should explain, why I'm asking. 
I'm searching for documents, but for most (almost all) of them I don't really 
care about their content. I only want to know a particular numeric field from 
document (id of document's category). 
I also need to know how many docs in category were found, so I can't index 
categories instead of docs. 
The result set can be pertty big (30K) and all must be handled in inner loop. 
So I wanna use HitCollector and assign intervals of ids to categories of 
documents. Following this way, there's no need to actually retrieve document 
in inner loop. 

Am I on the right way?

Mood: wondering, why SQL GROUP BY works so fast.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document numbers and ids

2005-02-04 Thread Erik Hatcher
On Feb 4, 2005, at 9:49 AM, Simeon Koptelov wrote:
The LiA says that I can use Sort.INDEXORDER when indexing order is 
relevant
and gives an example where documents' ids (got from Hits.id() ) are
increasing from top to bottom of resultset. Are that ids the same 
thing as
document numbers?
Yes, id is the same as document number.
If they are the same, how can it be that they are preserved during 
indexing
process? LiA says that documents are renumbered when merging segments.
By "renumbered", it means it squeezes out holes left by deletes.  The 
actual order does not change and thus does not affect a Sort.INDEXORDER 
sort.

Documents are stored in the index in the order that they were indexed - 
nothing changes this order.  Document id's are not permanent if deletes 
occur followed by an optimize.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Document numbers and ids

2005-02-04 Thread Simeon Koptelov
The LiA says that I can use Sort.INDEXORDER when indexing order is relevant 
and gives an example where documents' ids (got from Hits.id() ) are 
increasing from top to bottom of resultset. Are that ids the same thing as 
document numbers? 

If they are the same, how can it be that they are preserved during indexing 
process? LiA says that documents are renumbered when merging segments.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: document numbers

2005-01-31 Thread Morus Walter
Hi Jonathan,

> Yet another burning question :-).  Can someone explain how the document 
> numbers in Lucene documents work?  For example, the TermDocs.doc() 
> method returns "the current doc number."  How can I get this doc number 
> if I just have a Document?
> 
I don't think you can.
A document does not even have to be indexed yet.

So either you're dealing with some document found in the index, then you 
should have the document number already, or you have a document independently
from the index, then you have to analyze the documents content and count
yourself.

Note that term vector support might be useful if you're interested in more
than one term (but that requires the document number again).

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



document numbers

2005-01-28 Thread Jonathan Lasko
Yet another burning question :-).  Can someone explain how the document 
numbers in Lucene documents work?  For example, the TermDocs.doc() 
method returns "the current doc number."  How can I get this doc number 
if I just have a Document?

Here's the context.  I'm working on implementing Justin Zobel's 
similarity functions (from his paper "Exploring the Similarity Space," 
mentioned previously by Ian) in a retrieval system based on Lucene.  I 
have a Lucene document and a term, for which I can get a TermDocs object 
from an IndexReader.  I then want to get the number of occurrences of 
that term in the specific Lucene document mentioned above.  I would like 
to call TermDocs.skip(<"doc # of Lucene doc">) and then freq() to get 
this frequency.  Is this the way to get the frequency of a term in a 
specific document?

Jonathan
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Iterernal Document Numbers

2004-04-01 Thread Doug Cutting
Joe Rayguy wrote:
So, assuming that sort as implemented in 1.4 doesn't
work for me, my original question still stands.  Do I
have to worry about merges that occur as documents are
added, or do I only have to rebuild my array after
optimizations?  Or, alternatively, how did everyone
sort before 1.4?
If you've made deletions since the last optimize, then document numbers 
can decrease as you do additions, when the deletions are dropped.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Iterernal Document Numbers

2004-04-01 Thread Joe Rayguy
Tim,

Thanks for your reply.  Believe me, the sorting in 1.4
is greatly welcomed, but I don't think it fits my
particular needs.  My criteria for the sort can change
without notice on the fly, and I'd rather not have to
recreate the index to accomdate this.

Using 1.3 I played with including the sort criteria
(in integer form) in the index, and had every query
include a range in conjunction with a custom
similarity that essentially made the "sort" field the
heaviest criteria.  This worked, although I had
concerns about the speed (including a range in every
query that spanned all documents), and I still had the
issue of how to change scores easily.  I went as far
as to hack up a tool that could modify the index
itself and change the termfreq for my sorting field,
but this is not the road I want to go down! :)

So, assuming that sort as implemented in 1.4 doesn't
work for me, my original question still stands.  Do I
have to worry about merges that occur as documents are
added, or do I only have to rebuild my array after
optimizations?  Or, alternatively, how did everyone
sort before 1.4?

Joe

--- Tim Jones <[EMAIL PROTECTED]> wrote:
> the 1.4 release contains sorting code that sorts 
> similarly to your description.  You can get the
> latest 1.4 release here:
> 
> http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc2/
> 
> look at org.apache.lucene.search.Sort
> 
> 
> > -Original Message-
> > From: Joe Rayguy [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, April 01, 2004 11:58 AM
> > To: [EMAIL PROTECTED]
> > Subject: Iterernal Document Numbers
> > 
> > 
> > Hi,
> > 
> > I apologize if this has been answered before, but
> is
> > it safe to design an application that sorts hits
> using
> > an external array based on each hit's internal
> > document ID?  It seems simple enough to rebuild
> the
> > array after an optimization, but what about merges
> > that
> > occur in the course of adding documents?  If I
> plan on
> > adding documents every minute or so recreating
> this
> > array with each addition doesn't seem feasible.
> > 
> > Is there a recommended way to handle such an array
> for
> > sorting results?
> > 
> > Thanks.
> > 
> > __
> > Do you Yahoo!?
> > Yahoo! Mail - More reliable, more storage, less
> spam
> > http://mail.yahoo.com
> > 
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Yahoo! Mail - More reliable, more storage, less spam
http://mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Iterernal Document Numbers

2004-04-01 Thread Tim Jones
the 1.4 release contains sorting code that sorts 
similarly to your description.  You can get the
latest 1.4 release here:

http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc2/

look at org.apache.lucene.search.Sort


> -Original Message-
> From: Joe Rayguy [mailto:[EMAIL PROTECTED]
> Sent: Thursday, April 01, 2004 11:58 AM
> To: [EMAIL PROTECTED]
> Subject: Iterernal Document Numbers
> 
> 
> Hi,
> 
> I apologize if this has been answered before, but is
> it safe to design an application that sorts hits using
> an external array based on each hit's internal
> document ID?  It seems simple enough to rebuild the
> array after an optimization, but what about merges
> that
> occur in the course of adding documents?  If I plan on
> adding documents every minute or so recreating this
> array with each addition doesn't seem feasible.
> 
> Is there a recommended way to handle such an array for
> sorting results?
> 
> Thanks.
> 
> __
> Do you Yahoo!?
> Yahoo! Mail - More reliable, more storage, less spam
> http://mail.yahoo.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Iterernal Document Numbers

2004-04-01 Thread Joe Rayguy
Hi,

I apologize if this has been answered before, but is
it safe to design an application that sorts hits using
an external array based on each hit's internal
document ID?  It seems simple enough to rebuild the
array after an optimization, but what about merges
that
occur in the course of adding documents?  If I plan on
adding documents every minute or so recreating this
array with each addition doesn't seem feasible.

Is there a recommended way to handle such an array for
sorting results?

Thanks.

__
Do you Yahoo!?
Yahoo! Mail - More reliable, more storage, less spam
http://mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: document numbers

2003-06-11 Thread Otis Gospodnetic
Moving this to lucene-user.

I believe so, yes.
However, doc IDs are not persistent.  They get reused.  So running this
same code sometime in the future may give you a different Document for
doc ID 45.

Otis

--- Maurice Coyle <[EMAIL PROTECTED]> wrote:
> hi,
> 
> i wonder could someone tell me if document numbers in a lucene index
> are
> consistent?
> that is, if i have an IndexReader read and i have the following code:
> 
> Document doc = read.document(45);
> 
> and later on, i have the following:
> 
> Term term = new Term(...);
> 
> TermDocs termdocs = read.termDocs(term);
> while(termdocs.next())
> {
> int docnum = termdocs.doc();
> 
> if(docnum==45)System.out.println("found it!");
> }
> 
> when the above code prints out that it has found it, is it referring
> to the
> same document as in the first code snippet above?
> 
> sorry if this is unclear, if so just say the word and i'll elaborate.
> 
> thanks,
> maurice
> 


__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]