The 2GB segment size limit

2008-06-25 Thread Nadav Har'El
Hi, Recently an index I've been building passed the 2 GB mark, and after I optimize()ed it into one segment over 2 GB, it stopped working. Apparently, this is a known problem (on 32 bit JVMs), and mentioned in the FAQ, http://wiki.apache.org/lucene-java/LuceneFAQ question Is there a way to limit

Re: Fwd: changing index format

2008-06-25 Thread Paul Elschot
Op Wednesday 25 June 2008 07:03:59 schreef John Wang: Hi guys: Perhaps I should have posted this to this list in the first place. I am trying to work on a patch to for each term, expose minDoc and maxDoc. This value can be retrieve while constructing the TermInfo. Knowing

Re: The 2GB segment size limit

2008-06-25 Thread Michael McCandless
Nadav Har'El wrote: Recently an index I've been building passed the 2 GB mark, and after I optimize()ed it into one segment over 2 GB, it stopped working. Nadav, which platform did you hit this on? I think I've created 2 GB index on 32 bit WinXP just fine. How many platforms are really

Re: ReaderCommit

2008-06-25 Thread Michael McCandless
Jason Rutherglen wrote: For Ocean I created a workaround where the IndexCommits from IndexDeletionPolicy are saved in a map in order to achieve deleting based on the IndexReader. It would be more straightforward to delete from the IndexCommit in IndexReader. It seems like we are mixing

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

2008-06-25 Thread Michael McCandless
Jason Rutherglen wrote: One of the bottlenecks I have noticed testing Ocean realtime search is the delete process which involves writing several files for each possibly single delete of a document in SegmentReader. The best way to handle the deletes is too simply keep them in memory

Re: changing index format

2008-06-25 Thread Michael McCandless
John Wang wrote: The problem I am having is stated below, I don't know how to add the minDoc and maxDoc values to the index while keeping backward compatibility. Unfortunately, TermInfo file format just isn't extensible at the moment, so I think for now you'll have to break

[jira] Commented: (LUCENE-1314) IndexReader.reopen(boolean force)

2008-06-25 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12607954#action_12607954 ] Michael McCandless commented on LUCENE-1314: bq. In my SegmentReader subclass

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

2008-06-25 Thread Jason Rutherglen
I understand what you are saying. I am not sure it is worth clearly quite a bit more work given how easy it is to simply be able to have more control over the IndexReader deletedDocs BitVector which seems like a feature that should be in there anyways, perhaps even allowing SortedVIntList to be

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless [EMAIL PROTECTED] wrote: We've also discussed at one point creating an IndexReader impl that searches the RAM buffer that DocumentsWriter writes to when adding documents. I think it's easier than it sounds, on first glance, because

[jira] Commented: (LUCENE-1314) IndexReader.reopen(boolean force)

2008-06-25 Thread Jason Rutherglen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608039#action_12608039 ] Jason Rutherglen commented on LUCENE-1314: -- Here is the code of the SegmentReader

Re: per-field similarity

2008-06-25 Thread Karl Wettin
+1 24 jun 2008 kl. 22.28 skrev Yonik Seeley: Something to consider for Lucene 3 is to have something to retrieve Similarity per-field rather than passing the field name into some functions... benefits: - Would allow customizing most Similarity functions per-field - Performance: Similarity for

Is there a reason MemoryIndex does not implement Serializable?

2008-06-25 Thread Jason Rutherglen
It seems like it could, it even has serialVersionUID defined.

Re: Fwd: changing index format

2008-06-25 Thread John Wang
Thanks Paul and Mike for the feedback. Paul, for us, sparsity of the docIds determine which data structure to use. Where cardinality gives some of that, min/max docId would also help, example: say maxdoc=100, cardinality = 7, docids: {0,1,...6} or {3,4...9}, using arrayDocIdSet

Re: Is there a reason MemoryIndex does not implement Serializable?

2008-06-25 Thread Erik Hatcher
No reason done! Erik On Jun 25, 2008, at 11:05 AM, Jason Rutherglen wrote: It seems like it could, it even has serialVersionUID defined. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

2008-06-25 Thread Jason Rutherglen
I read other parts of the email but glanced over this part. Would terms be automatically sorted as they came in? If implemented it would be nice to be able to get an encoded representation (probably byte array) of the document and postings which could be written to a log, and then reentered in

[jira] Created: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Todd Feak (JIRA)
Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer Key: LUCENE-1316 URL: https://issues.apache.org/jira/browse/LUCENE-1316 Project: Lucene - Java

[jira] Updated: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Todd Feak (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Feak updated LUCENE-1316: -- Further investigation indicates that the ValueSourceQuery$ValueSourceScorer may suffer from the same

Re: Fwd: changing index format

2008-06-25 Thread John Wang
Hi Paul: Regarding to your comment on adding required/prohibited to BooleanQuery: Based on the new api on DocIdSet and DocIdSetIterator abstractions, we also developed decorators such as AndDocIdSet,OrDocIdSet and NotDocIdSet, furthermore a DocIdSetQuery class that honors the Query api

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 11:30 AM, Jason Rutherglen [EMAIL PROTECTED] wrote: I read other parts of the email but glanced over this part. Would terms be automatically sorted as they came in? If implemented it would be nice to be able to get an encoded representation (probably byte array) of the

BooleanQuery and DocIdSet; Was: Fwd: changing index format

2008-06-25 Thread Paul Elschot
Op Wednesday 25 June 2008 18:45:16 schreef John Wang: Hi Paul: Regarding to your comment on adding required/prohibited to BooleanQuery: Based on the new api on DocIdSet and DocIdSetIterator abstractions, we also developed decorators such as AndDocIdSet,OrDocIdSet and NotDocIdSet,

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608128#action_12608128 ] Yonik Seeley commented on LUCENE-1316: -- Although this doesn't solve the general

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Hoss Man (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608129#action_12608129 ] Hoss Man commented on LUCENE-1316: -- rather then attempting localized optimizations of

[jira] Updated: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Todd Feak (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Feak updated LUCENE-1316: -- I like Hoss' suggestion better. I'll try that fix locally and if it provides the same improvement, I

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608134#action_12608134 ] Yonik Seeley commented on LUCENE-1316: -- a more generalized improvements would

Re: per-field similarity

2008-06-25 Thread Chris Hostetter
: Might also consider passing in more optional context when retrieving : the similarity for a field (such as a Query, if searching). : Something like Similarity.getSimilarity(String field, Query q). i assume you mean Searcher.getSimilarity(String fieldName, Query q) to replace the current

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Hoss Man (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608137#action_12608137 ] Hoss Man commented on LUCENE-1316: -- bq. Code that depended on deletes being instantly

Re: per-field similarity

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 2:19 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : Might also consider passing in more optional context when retrieving : the similarity for a field (such as a Query, if searching). : Something like Similarity.getSimilarity(String field, Query q). i assume you mean

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread robert engels (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608146#action_12608146 ] robert engels commented on LUCENE-1316: --- According to the java memory model,

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608147#action_12608147 ] Yonik Seeley commented on LUCENE-1316: -- bq. why would deletes be stop being instantly

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread robert engels (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608149#action_12608149 ] robert engels commented on LUCENE-1316: --- The Pattern#5 referenced (cheap read-write

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608160#action_12608160 ] Yonik Seeley commented on LUCENE-1316: -- bq. declaring the deletedDocs volatile should

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608162#action_12608162 ] Mark Miller commented on LUCENE-1316: - If I remember correctly, volatile does not work

[jira] Issue Comment Edited: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608162#action_12608162 ] [EMAIL PROTECTED] edited comment on LUCENE-1316 at 6/25/08 12:40 PM:

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Hoss Man (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608183#action_12608183 ] Hoss Man commented on LUCENE-1316: -- bq. if thread A deleted a document, and then thread B

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread robert engels (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608187#action_12608187 ] robert engels commented on LUCENE-1316: --- Hoss, that is indeed the case, another

Re: BooleanQuery and DocIdSet; Was: Fwd: changing index format

2008-06-25 Thread John Wang
I am not sure, BooleanQuery takes something that can score, e.g. being a Clause or a Query, the contract requires some sort of scoring functionality. We use DocIdSetQuery for some of the scoring capabilities such as constant score (with boosting), age decay, and using the new scoring api in 2.3.

[jira] Commented: (LUCENE-1316) Avoidable synchronization bottleneck in MatchAlldocsQuery$MatchAllScorer

2008-06-25 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12608189#action_12608189 ] Yonik Seeley commented on LUCENE-1316: -- bq. is your point that without

Re: How to do a query using less than or greater than

2008-06-25 Thread Chris Hostetter
: and how to use them? For a concrete example I'm looking to do a query : on a date field to find documents earlier than a specified date or : later than a specified date. Ex: date:( 20070101) or date: : (20070101). I looked at the range query feature but it didn't appear : to cover this

Re: per-field similarity

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 5:06 PM, Chris Hostetter [EMAIL PROTECTED] wrote: Hmmm... that seems like it would be confusing: particularly since in the IndexWriter case the Query param would never make sense. changing IndexWriter.getSimilarity to take a String fieldName and changing

Re: per-field similarity

2008-06-25 Thread Mike Klaas
On 24-Jun-08, at 1:28 PM, Yonik Seeley wrote: Something to consider for Lucene 3 is to have something to retrieve Similarity per-field rather than passing the field name into some functions... +1 I've felt that this was the proper (and more useful) way to do things for a long time

Re: How to do a query using less than or greater than

2008-06-25 Thread Kyle Miller
Chris, That's exactly what I was looking for. Thanks for the info and the clarification on where to post my questions. Regards, Kyle On Wed, Jun 25, 2008 at 5:12 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : and how to use them? For a concrete example I'm looking to do a query : on a