Fwd: lucene indexing and merge process

2007-10-18 Thread Erik Hatcher
Forwarding this to java-dev per request. Seems like the best place to discuss this topic. Erik Begin forwarded message: From: John Wang [EMAIL PROTECTED] Date: October 17, 2007 5:43:29 PM EDT To: [EMAIL PROTECTED] Subject: lucene indexing and merge process Hi Erik: We are

Re: lucene indexing and merge process

2007-10-18 Thread Ning Li
Make all documents have a term, say ID:UID, and for each document, store its UID in the term's payload. You can read off this posting list to create your array. Will this work for you, John? Cheers, Ning On 10/18/07, Erik Hatcher [EMAIL PROTECTED] wrote: Forwarding this to java-dev per

Re: Fwd: lucene indexing and merge process

2007-10-18 Thread Doug Cutting
Erik Hatcher wrote: 2) Load/Warmup the FieldCache (for large corpus, loading up the indexreader can be slow) With the new IndexReader#reopen(), the cost of opening a new IndexReader is much reduced. However, loading a FieldCache is not that much faster, so that may or may not be enough to

Re: Fwd: lucene indexing and merge process

2007-10-18 Thread Mark Miller
Hoss has worked on a new FieldCache implementation that should address this if finished and used with the new reopen. I have been meaning to look at it in greater detail myself, but havn't gotten at it. It sounds as if he has been a bit too busy to be be able to work on it himself. It would

[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-10-18 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535970 ] Yonik Seeley commented on LUCENE-743: - {quote} A reader which is being used for deletes or setting norms is

Re: lucene indexing and merge process

2007-10-18 Thread robert engels
This is what I do with general search caches. It works very well. I think the same approach would work great with the field cache. I do think though that we might want direct support for this - using a fixed length field file (per segment). E.g. so that you would configure keys with n

Re: lucene indexing and merge process

2007-10-18 Thread John Wang
Hi Robert: Say I have a hit set of 1000 docids, would I need to go to the disk for each docid to get the key? Thanks -John On 10/18/07, robert engels [EMAIL PROTECTED] wrote: This is what I do with general search caches. It works very well. I think the same approach would work great

Re: lucene indexing and merge process

2007-10-18 Thread John Wang
Hi Ning: That is essentially what field cache does. Doing this for each docid in the result set will be slow if the result set is large. But loading it in memory when opening index can also be slow if the index is large and updates often. Thanks -John On 10/18/07, Ning Li [EMAIL PROTECTED]

Re: lucene indexing and merge process

2007-10-18 Thread Doug Cutting
robert engels wrote: seek (segment doc no * keylength), read (byte[keylength]) This would be very efficient when using external document storage. A seek per document in hits is to be avoided. This is similar to the way field data is stored, which is, as mentioned in the first message very

Re: lucene indexing and merge process

2007-10-18 Thread robert engels
True, but what is the other option except loading all of them in memory? On Oct 18, 2007, at 11:57 AM, Doug Cutting wrote: robert engels wrote: seek (segment doc no * keylength), read (byte[keylength]) This would be very efficient when using external document storage. A seek per document in

Re: lucene indexing and merge process

2007-10-18 Thread robert engels
As a follow-up, it seemed that in the past much of Lucene relied on the OS disk cache for performance. The FieldCache seems to go against this, probably because of the parsing involved. The 'fixed-length' key file would not need extensive parsing, and thus seems more suitable for OS level

Re: lucene indexing and merge process

2007-10-18 Thread Doug Cutting
robert engels wrote: True, but what is the other option except loading all of them in memory? Loading them into memory is the FieldCache approach. It is effective in many cases. If there's not enough memory, then Ning's proposal might provide a middle ground: efficient sequential access

Re: lucene indexing and merge process

2007-10-18 Thread Ning Li
I see what you mean by 2) now. What Mark said should work for you when it's done. Cheers, Ning On 10/18/07, John Wang [EMAIL PROTECTED] wrote: Hi Ning: That is essentially what field cache does. Doing this for each docid in the result set will be slow if the result set is large. But

Re: [jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2007-10-18 Thread Sean Timm
Roy, Thanks for the review and comments. My comments inline below. Roy Ward wrote: (1) You only added timeouts to: public TopDocs search(Weight weight, Filter filter, final int nDocs) It's confusing if timeout functionality is not also added to: public TopFieldDocs search(Weight

[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-10-18 Thread Michael Busch (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536028 ] Michael Busch commented on LUCENE-743: -- Hmm one other thing: how should IndexReader.close() work? If we re-open

[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-10-18 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536033 ] Michael McCandless commented on LUCENE-743: --- I think reference counting would solve this issue quite

[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-10-18 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536041 ] Yonik Seeley commented on LUCENE-743: - When it is closed, it decrefs the RC and marks itself closed (to make

[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-10-18 Thread Michael Busch (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536044 ] Michael Busch commented on LUCENE-743: -- {quote} The implementation seems simple. When a reader is opened, it

[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-10-18 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536063 ] Michael McCandless commented on LUCENE-743: --- But if a reader is shared, how do you tell two real closes

Exposing API for scoring

2007-10-18 Thread Shailesh Kochhar
Hi, I'm experimenting with a few different scoring implementations and I was wondering what the easiest way would be to incorporate a new scorer into a searcher implementation. From reading the docs on Scoring at:

Re: [jira] Commented: (LUCENE-997) Add search timeout support to Lucene

2007-10-18 Thread Roy Ward
Sean Timm wrote: (2) Estimating the the number of results snip Is there a test case that shows this breakage, or can you point me to the code in Hits.java that my patch causes problems with? Sorry, I'm not seeing it. In the case of no hits at all getting returned, the following code: