[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929629#action_12929629
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I'm running test-core multiple times and am seeing some lurking test
failures (thanks to the randomized tests that have been recently added).
I'm guessing they're related to the syncs on IW and DW not being in sync
some of the time. 

I will clean up the patch so that others may properly review it and
hopefully we can figure out what's going on. 

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-08 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

Here's a cleaned up patch, please take a look.  I ran 'ant test-core' 5 times 
with no failures, however running the below several times does eventually 
produce a failure.

ant test-core -Dtestcase=TestThreadedOptimize -Dtestmethod=testThreadedOptimize 
-Dtests.seed=1547315783637080859:5267275843141383546

ant test-core -Dtestcase=TestIndexWriterMergePolicy 
-Dtestmethod=testMaxBufferedDocsChange 
-Dtests.seed=7382971652679988823:-6672235304390823521

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929810#action_12929810
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

The problem could be that IW deleteDocument is not synced on IW,
when I tried adding the sync, there was deadlock perhaps from DW
waitReady. We could be adding pending deletes to segments that
are not quite current because we're not adding them in an IW
sync block.

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929927#action_12929927
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Ok, TestThreadedOptimize works when the DW sync'ed pushSegmentInfos method
isn't called anymore (no extra per-segment deleting is going on), and stops
working when pushSegmentInfos is turned back on. Something about the sync
on DW is causing a problem.  Hmm... We need another way to pass segment
infos around consistently. 

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-07 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

I placed (for now) the segment deletes directly into the segment info object.  
There's applied term/queries sets which are checked against when apply deletes 
all is called.  All tests pass except for TestTransactions and 
TestPersistentSnapshotDeletionPolicy only because of an assertion check I 
added, that the last segment info is in fact in the newly pushed segment infos. 
 I think in both cases segment infos is being altered in IW in a place where 
the segment infos isn't being pushed, yet.  I wanted to checkpoint this though 
as it's a fairly well working at this point, including the last segment 
info/index, which is can be turned on or off via a static variable.  

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-07 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

Everything passes, except for tests that involve IW rollback.  We need to be 
able to rollback the last segment info/index in DW, however I'm not sure how we 
want to do that quite yet.

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929424#action_12929424
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

In DW abort (called by IW rollbackInternal) we should be able to simply clear 
all per segment pending deletes, however, I'm not sure we can do that, in fact, 
if we have applied deletes for a merge, then we rollback, we can't undo those 
deletes thereby breaking our current rollback model?

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-07 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

Here's an uncleaned up cut with all tests passing. I nulled out
the lastSegmentInfo on abort which fixes the my own assertion
that was causing the rollback tests to not pass. I don't know if
this is cheating or not yet just to get the tests to pass.

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-06 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929229#action_12929229
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Pushing the segment infos seems to have cleared up some of the tests failing, 
however intermittently (1/4 of the time) there's the one below.

I'm going to re-add lastSegmentInfo/Index, and assert that if we're not using 
it, that the deletes obtained from the segmentinfo - deletes map is the same.  

{code}
[junit] Testsuite: org.apache.lucene.index.TestStressIndexing2
[junit] Testcase: testRandom(org.apache.lucene.index.TestStressIndexing2):  
FAILED
[junit] expected:12 but was:11
[junit] junit.framework.AssertionFailedError: expected:12 but was:11
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:278)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:271)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.testRandom(TestStressIndexing2.java:89)
{code}

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-06 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929247#action_12929247
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I wasn't coalescing the merged segment's deletes, with that implemented, 
TestStressIndexing2 ran successfully 49 of 50 times.  Below is the error:

{code}
[junit] Testsuite: org.apache.lucene.index.TestStressIndexing2
[junit] Testcase: 
testMultiConfig(org.apache.lucene.index.TestStressIndexing2): FAILED
[junit] expected:5 but was:4
[junit] junit.framework.AssertionFailedError: expected:5 but was:4
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:278)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:271)
[junit] at 
org.apache.lucene.index.TestStressIndexing2.testMultiConfig(TestStressIndexing2.java:115)
{code}

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-06 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929254#action_12929254
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Putting a sync on DW block around the bulk of the segment alterations in IW 
commitMerge seems to have quelled the TestStressIndexing2 test failures.  Nice.

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-06 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

Here's a check point patch before I re-add lastSegmentInfo/Index.  All tests 
pass except for what's below.  I'm guessing segments with all docs deleted, are 
deleted before the test expects.

{code}
[junit] Testcase: 
testCommitThreadSafety(org.apache.lucene.index.TestIndexWriter):  FAILED
[junit] 
[junit] junit.framework.AssertionFailedError: 
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
[junit] at 
org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety(TestIndexWriter.java:4699)
[junit] 
[junit] 
[junit] Testcase: 
testCommitThreadSafety(org.apache.lucene.index.TestIndexWriter):  FAILED
[junit] Some threads threw uncaught exceptions!
[junit] junit.framework.AssertionFailedError: Some threads threw uncaught 
exceptions!
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844)
[junit] at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:437)
[junit] 
[junit] 
[junit] Tests run: 116, Failures: 2, Errors: 0, Time elapsed: 159.577 sec
[junit] 
[junit] - Standard Output ---
[junit] NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter 
-Dtestmethod=testCommitThreadSafety 
-Dtests.seed=1826133140332330367:810264330792545
[junit] NOTE: test params are: codec=MockFixedIntBlock(blockSize=564), 
locale=es_CR, timezone=Asia/Urumqi
[junit] -  ---
[junit] - Standard Error -
[junit] The following exceptions were thrown by threads:
[junit] *** Thread: Thread-1106 ***
[junit] java.lang.RuntimeException: java.lang.AssertionError: term=f:0_8; 
r=DirectoryReader(_0:c1  _1:c1  _2:c1  _3:c1  _4:c1  _5:c1  _6:c1  _7:c2  _8:c4 
) expected:1 but was:0
[junit] at 
org.apache.lucene.index.TestIndexWriter$9.run(TestIndexWriter.java:4690)
[junit] Caused by: java.lang.AssertionError: term=f:0_8; 
r=DirectoryReader(_0:c1  _1:c1  _2:c1  _3:c1  _4:c1  _5:c1  _6:c1  _7:c2  _8:c4 
) expected:1 but was:0
[junit] at org.junit.Assert.fail(Assert.java:91)
[junit] at org.junit.Assert.failNotEquals(Assert.java:645)
[junit] at org.junit.Assert.assertEquals(Assert.java:126)
[junit] at org.junit.Assert.assertEquals(Assert.java:470)
[junit] at 
org.apache.lucene.index.TestIndexWriter$9.run(TestIndexWriter.java:4684)
[junit] NOTE: all tests run in this JVM:
[junit] [TestMockAnalyzer, TestByteSlices, TestFilterIndexReader, 
TestIndexFileDeleter, TestIndexReaderClone, TestIndexReaderReopen, 
TestIndexWriter]
{code}

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-05 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928933#action_12928933
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Sorry, spoke too soon, I made a small change to not redundantly delete, in 
apply deletes all and TestStressIndexing2 is breaking.  I think we need to 
push segment infos changes to DW as they happen.  I'm guessing that segment 
infos are being shuffled around and so the infos passed into DW in IW deleteDoc 
methods may be out of date by the time deletes are attached to segments.  
Hopefully there aren't any lurking deadlock issues with this.

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-04 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

Here's a nice little checkpoint with more tests passing.  

* A last known segment is recorded, which is the last segment seen when
adding a delete term/query per segment. This is for a applyDeletesAll
check to ensure a given query/term has not already been applied to a
segment. If a term/query exists in the per-segment deletes and is in
deletesFlushed, we delete, unless we're beyond the last known segment, at
which point we simply delete (adhering of course to the docid-upto).

* In the interest of accuracy I nixed lastSegmentIndex in favor of
lastSegmentInfo which is easier for debugging and implementation when
segments are shuffled around and/or removed/added. There's not too much of
a penalty in terms of performance. 

* org.apache.lucene.index tests pass

* I need to address the applying deletes only on readers within the
docid-upto per term/query, perhaps that's best left to a different Jira
issue.

* Still not committable as it needs cleaning up, complete unit tests, who
knows what else.

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, 
 LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927943#action_12927943
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

I'm redoing things a bit to take into account the concurrency of merges.  For 
example, if a merge fails, we need to not have removed those segments' deletes 
to be applied.  Also probably the most tricky part is that lastSegmentIndex 
could have changed since a merge started, which means we need to be careful 
about how and which deletes we coalesce.

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927979#action_12927979
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

Another use case that can be wacky is if commit is called and a merge is 
finishing before or after, in that case all (point-in-time) deletes will have 
been applied by commit, however do we want to clear all per-segment deletes at 
the end of commit?  This would blank out deletes being applied by the merge, 
most of which should be cleared out, however if new deletes arrived during the 
commit (is this possible?), then we want these to be attached to segments and 
not lost.  I guess we want to DW sync'd clear out deletes in the 
applyDeletesAll method.  ADA will apply those deletes, any incoming will queue 
up and be shuffled around.

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928075#action_12928075
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

There's an issue in that we're redundantly applying deletes in the 
applyDeletesAll case because the deletes may have already been applied to a 
segment when a merge happened, ie, by applyDeletesToSegments.  In the ADA case 
we need to use applyDeletesToSegments up to the segment point when the buffered 
deletes can be used.  

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12928078#action_12928078
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

This brings up another issue which is we're blindly iterating over docs in a 
segment reader to delete, even if we can know ahead of time that the reader's 
docs are going to exceed the term/query's docid-upto (from the max doc of the 
reader).  In applyDeletes we're opening a term docs iterator, though I think 
we're breaking at the first doc and moving on if the docid-upto is exceeded.  
This term docs iterator opening can be skipped.

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-02 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927488#action_12927488
 ] 

Jason Rutherglen commented on LUCENE-2680:
--

bq. The most recent one wins, and we should do only one delete (per segment) 
for that term.

How should we define this recency and why does it matter?  Should it be per 
term/query or for the entire BD?

I think there's an issue with keeping lastSegmentIndex in DW, while it's easy 
to maintain, Mike had mentioned keeping the lastSegmentIndex per 
BufferedDeletes object.  Coalescing the BDs should be easier to maintain after 
successful merge than maintaining a separate BD for them.  We'll see.

I'll put together another patch with these changes.

 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-02 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

Here's a new patch with properly working last segment index.  

The trunk version of apply deletes has become applyDeletesAll and is 
functionally unchanged.

There's a new method, DW applyDeletesToSegments called by _mergeInit for 
segments that are about to be merged.  The deleted terms and queries for these 
segments are kept in hash sets because docid-uptos are not needed.  

Like the last patch DW maintains the last segment index.  There's no need to 
maintain the last-segindex per BD, instead I think it's only useful per DW, and 
for trunk we only have one DW being used at a time.  

On successful merge, the last segment index is set to the segment index 
previous to the start segment of the merge.  The merged segments deletes are 
coalesced into the startIndex-1's segment deletes.


 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch, LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-11-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927031#action_12927031
 ] 

Jason Rutherglen commented on LUCENE-2729:
--

Using Solr 1.4.2 on disk full .del files were being written with a file length 
of zero, however that is supposed to be fixed by 
https://issues.apache.org/jira/browse/LUCENE-2593  This doesn't appear to be 
similar because more than the .del files are of zero length.

 Index corruption after 'read past EOF' under heavy update load and snapshot 
 export
 --

 Key: LUCENE-2729
 URL: https://issues.apache.org/jira/browse/LUCENE-2729
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1, 3.0.2
 Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
 Integrated with zoie (using a zoie snapshot from 2010-08-06: 
 zoie-2.0.0-snapshot-20100806.jar).
Reporter: Nico Krijnen

 We have a system running lucene and zoie. We use lucene as a content store 
 for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
 backups of the index. This works fine for small indexes and when there are 
 not a lot of changes to the index when the backup is made.
 On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
 is being changed a lot (lots of document additions and/or deletions), we 
 almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
 obtain timed out'.
 At that point we get lots of 0 kb files in the index, data gets lots, and the 
 index is unusable.
 When we stop our server, remove the 0kb files and restart our server, the 
 index is operational again, but data has been lost.
 I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
 Hopefully someone has some ideas where to look to fix this.
 Some more details...
 Stack trace of the read past EOF and following Lock obtain timed out:
 {code}
 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
 ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
 java.io.IOException: read past EOF
 at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
 at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
 at 
 org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
 at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
 at 
 org.apache.lucene.index.IndexFileDeleter.init(IndexFileDeleter.java:166)
 at 
 org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
 at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
 at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
 at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
 at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
 at 
 proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
 at 
 proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
 at 
 proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
 at 
 proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
 at 
 proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
 at 
 proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
 ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
 Problem copying segments: Lock obtain timed out: 
 org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
 org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
 org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:84)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:957)
 at 
 proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
 at 
 proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
 at 
 proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
 at 
 proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
 at 
 

[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments

2010-11-01 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2680:
-

Attachment: LUCENE-2680.patch

The general approach is to reuse BufferedDeletes though place them into a 
segment info keyed map for those segments generated post lastSegmentIndex as 
per what has been discussed here 
https://issues.apache.org/jira/browse/LUCENE-2655?focusedCommentId=12922894page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12922894
 and below.

* lastSegmentIndex is added to IW

* DW segmentDeletes is a map of segment info - buffered deletes.  In the apply 
deletes method buffered deletes are pulled for a given segment info if they 
exist, otherwise they're taken from deletesFlushedLastSeg.  

* I'm not entirely sure what pushDeletes should do now, probably the same thing 
as currently, only the name should change slightly in that it's pushing deletes 
only for the RAM buffer docs.

* There needs to be tests to ensure the docid-upto logic is working correctly

* I'm not sure what to do with DW hasDeletes (it's usage is commented out)

* Does there need to be separate deletes for the ram buffer vis-à-vis the (0 - 
lastSegmentIndex) deletes?

* The memory accounting'll now get interesting as we'll need to track the RAM 
usage of terms/queries across multiple maps.  

* In commitMerge, DW verifySegmentDeletes removes the unused info - deletes

* testDeletes deletes a doc in segment 1, then merges segments 1 and 2.  We 
then test to insure the deletes were in fact applied only to segment 1 and 2.  

* testInitLastSegmentIndex insures that on IW init, the lastSegmentIndex is in 
fact set


 Improve how IndexWriter flushes deletes against existing segments
 -

 Key: LUCENE-2680
 URL: https://issues.apache.org/jira/browse/LUCENE-2680
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2680.patch


 IndexWriter buffers up all deletes (by Term and Query) and only
 applies them if 1) commit or NRT getReader() is called, or 2) a merge
 is about to kickoff.
 We do this because, for a large index, it's very costly to open a
 SegmentReader for every segment in the index.  So we defer as long as
 we can.  We do it just before merge so that the merge can eliminate
 the deleted docs.
 But, most merges are small, yet in a big index we apply deletes to all
 of the segments, which is really very wasteful.
 Instead, we should only apply the buffered deletes to the segments
 that are about to be merged, and keep the buffer around for the
 remaining segments.
 I think it's not so hard to do; we'd have to have generations of
 pending deletions, because the newly merged segment doesn't need the
 same buffered deletions applied again.  So every time a merge kicks
 off, we pinch off the current set of buffered deletions, open a new
 set (the next generation), and record which segment was created as of
 which generation.
 This should be a very sizable gain for large indices that mix
 deletes, though, less so in flex since opening the terms index is much
 faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

2010-10-29 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926360#action_12926360
 ] 

Jason Rutherglen commented on LUCENE-2729:
--

Post a listing of the index files with their lengths, ie, ls -la.  

 Index corruption after 'read past EOF' under heavy update load and snapshot 
 export
 --

 Key: LUCENE-2729
 URL: https://issues.apache.org/jira/browse/LUCENE-2729
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.1, 3.0.2
 Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
 Integrated with zoie (using a zoie snapshot from 2010-08-06: 
 zoie-2.0.0-snapshot-20100806.jar).
Reporter: Nico Krijnen

 We have a system running lucene and zoie. We use lucene as a content store 
 for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
 backups of the index. This works fine for small indexes and when there are 
 not a lot of changes to the index when the backup is made.
 On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
 is being changed a lot (lots of document additions and/or deletions), we 
 almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
 obtain timed out'.
 At that point we get lots of 0 kb files in the index, data gets lots, and the 
 index is unusable.
 When we stop our server, remove the 0kb files and restart our server, the 
 index is operational again, but data has been lost.
 I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
 Hopefully someone has some ideas where to look to fix this.
 Some more details...
 Stack trace of the read past EOF and following Lock obtain timed out:
 {code}
 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
 ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
 java.io.IOException: read past EOF
 at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
 at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
 at 
 org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
 at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
 at 
 org.apache.lucene.index.IndexFileDeleter.init(IndexFileDeleter.java:166)
 at 
 org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
 at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
 at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
 at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
 at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
 at 
 proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
 at 
 proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
 at 
 proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
 at 
 proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
 at 
 proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
 at 
 proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
 ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - Problem 
 copying segments: Lock obtain timed out: 
 org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
 org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
 org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:84)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:957)
 at 
 proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
 at 
 proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
 at 
 proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
 at 
 proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
 at 
 proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
 at 
 proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
 {code}
 We get exactly the same behavour on both OS X and on 

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-10-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12922597#action_12922597
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

FreqProxTermsWriterPerField writes the prox posting data as terms are seen, 
however for the freq data we wait until we're on to the next doc (to accurately 
record the doc freq), and seeing a previously analyzed term before writing to 
the freq stream.  Because the last doc code array in the posting array should 
not be copied per reader, when a document is finished, we need to flush the 
freq info out per term seen for that doc.  This way, on reader instigated 
flush, the reader may always read all necessary posting data from the byte 
slices, and not rely partially on the posting array.  I don't think this will 
affect indexing performance.

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch

 Attachments: LUCENE-2312.patch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-10-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921784#action_12921784
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

The issue with the model given is the posting-upto is handed to
the byte slice reader as the end index. However newly written
bytes may not actually make it to a reader thread as per the
JMM. A reader thread may reach partially written bytes. There
doesn't seem to be a way to tell the reader it's reached the end
of the written bytes and so we probably need to add 2 paged ints
arrays for freq and prox uptos respectively. This would be
unfortunate because either the paged ints will need to be
updated during the get reader call, or during indexing. Both
could be detrimental to performance, though the net is still
faster that the current NRT solution. The alternative is to
simply copy-on-write the byte blocks, though that'd need to
include the int blocks as well. I think we'd want to update the
paged ints during indexing, otherwise discount it as a solution
because otherwise it'd require full array iterations in the get
reader call to compare and update. The advantage of
copy-on-write of the blocks is the indexing speed will not be
affected, nor the read speed, the main potential performance
drag could be the garbage generated by the byte and int arrays
thrown away on reader close. It would depend on how many blocks
were updated in between get reader calls. 

We probably need to implement both solutions, try them out and
measure the performance difference. 

There's Michael B.'s multiple slice levels linked together
by atomic int arrays illustrated here:
http://www.box.net/shared/hivdg1hge9 

After reading this, the main idea I think we can use is to
instead of using paged ints, simply maintain 2 upto arrays. One
that's being written to, and a 2nd that's guaranteed to be in
sync with the byte blocks. This would save on garbage and
lookups into paged ints. The cost would is the array copy in the
get reader lock. Given the array already exists, the copy should
be fast?  Perhaps this is the go ahead solution?

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch, LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-10-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921786#action_12921786
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

We don't need to create a new spine array for the byte and int blocks per 
reader, these are append only spine arrays so we'll only copy a reference to 
them per reader.  

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch

 Attachments: LUCENE-2312.patch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-10-12 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920408#action_12920408
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

An interesting question for this issue is how will norms be handled?  It's an 
issue because norms requires the total number of terms, which we can compute 
per reader, however as we add readers, we can easily generate too much garbage 
(ie, a new norms byte[] per field per reader).  Perhaps we can relax the 
accuracy of the calculation for RT?



 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

2010-10-10 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2575:
-

Attachment: LUCENE-2575.patch

As per discussion, this patch removes byte block pool forwarding address 
rewrites by always allocating 4 bytes at the end of each slice.  newSlice has 
been replaced with newSliceByLevel because we were always calling this with the 
first level size.  TestByteSlices passes. 

With this working, we will not need to implement byte block copy-on-write.  
Instead, a posting-upto per reader will be used.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch, LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2693) Add delete term and query need to more precisely record the bytes used

2010-10-09 Thread Jason Rutherglen (JIRA)
Add delete term and query need to more precisely record the bytes used
--

 Key: LUCENE-2693
 URL: https://issues.apache.org/jira/browse/LUCENE-2693
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Jason Rutherglen
 Fix For: 4.0


DocumentsWriter's add delete query and add delete term add to the number of 
bytes used regardless of the query or term already existing in the respective 
map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2693) Add delete term and query need to more precisely record the bytes used

2010-10-09 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2693:
-

Attachment: LUCENE-2693.patch

Patch.  The query value was changed to BufferedDeletes.Num instead of Integer 
to save a little on object pointer allocation.

As a side note, there are a number of unrelated generics warnings when 
compiling.

 Add delete term and query need to more precisely record the bytes used
 --

 Key: LUCENE-2693
 URL: https://issues.apache.org/jira/browse/LUCENE-2693
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.9.4, 3.1, 4.0
Reporter: Jason Rutherglen
 Fix For: 2.9.4, 3.1, 4.0

 Attachments: LUCENE-2693.patch


 DocumentsWriter's add delete query and add delete term add to the number of 
 bytes used regardless of the query or term already existing in the respective 
 map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2662) BytesHash

2010-10-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917354#action_12917354
 ] 

Jason Rutherglen commented on LUCENE-2662:
--

Simon, I'm going to get deletes working, tests passing using maps in the RT 
branch, then we can integrate.  This'll probably be best.

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch, 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Minor
 Fix For: Realtime Branch, 4.0

 Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, 
 LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2662) BytesHash

2010-10-03 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2662:
-

Affects Version/s: (was: Realtime Branch)
Fix Version/s: (was: Realtime Branch)

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, 
 LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2662) BytesHash

2010-10-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917416#action_12917416
 ] 

Jason Rutherglen commented on LUCENE-2662:
--

Lets commit this to trunk.  We need to merge in all of trunk to the RT branch, 
or vice versa at some point anyways.  This patch could be a part of that bulk 
merge-in, or we can simply do it now.

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, 
 LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-10-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916914#action_12916914
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

bq. I fear a copy-on-write check per-term is going to be a sizable perf hit.

For indexing?  The byte[] buffers are also using a page based system.  I think 
we'll need to measure the performance difference.  We can always shift the cost 
to getreader by copying from a writable (indexing based) tf array into a 
per-reader tf of paged-ints.  While this'd be a complete iteration the length 
of the terms, the CPU cache could make it extremely fast (because each page 
would be cached, and we'd be iterating sequentially over an array, methinks).

The other cost is the lookup of the upto when iterating the postings, however 
that'd be one time per term-docs instantiation, ie, negligible.  



 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2655) Get deletes working in the realtime branch

2010-10-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916933#action_12916933
 ] 

Jason Rutherglen commented on LUCENE-2655:
--

Are you saying we should simply make deletes work (as is, no BytesRefHash 
conversion) then cleanup the RT branch as a merge to trunk of the DWPT changes? 
 I was thinking along those lines.  I can spend time making the rest of the 
unit tests work on the existing RT revision, though should this should probably 
happen in conjunction with a merge from trunk.  

Or simply make the tests pass, and merge RT - trunk afterwards?

Also, from what I've seen, deletes seem to work, I'm not sure what exactly 
Michael is referring to.  I'll run the full 'suite' of unit tests, and just 
make each work?

 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2655.patch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2655) Get deletes working in the realtime branch

2010-10-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916979#action_12916979
 ] 

Jason Rutherglen commented on LUCENE-2655:
--

{quote}when IW.deleteDocs(Term/Query) is called, we must go to each DWPT, grab 
its current docID, and enroll Term/Query - docID into that DWPT's pending 
deletes map.{quote}

Ok, that's the change you're referring to. In the current RT revision, the 
deletes are held in one map in DW, guess we need to change that. However if we 
do, why do we need to keep the seq id or docid as the value in the map? When 
the delete arrives into the DWPT, we know that any buffered docs with that 
term/query need to be deleted on flush? (ie, lets *not* worry about the RT 
search use case, yet). ie2, we can simply add the terms/queries to a set, and 
apply them on flush, ala LUCENE-2679?

bq. NRT improvements

We're referring to LUCENE-1516 as NRT and LUCENE-2312 as 'RT'. I'm guessing you 
mean RT? 

 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2655.patch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2655) Get deletes working in the realtime branch

2010-10-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917046#action_12917046
 ] 

Jason Rutherglen commented on LUCENE-2655:
--

We've implied an additional change to the way deletes are flushed in that 
today, they're flushed in applyDeletes when segments are merged, however with 
flush-DWPT we're applying deletes after flushing the DWPT segment.

Also we'll have a globalesque buffered deletes presumably located in IW that 
buffers deletes for the existing segments, and these should [as today] be 
applied only when segments are merged or getreader is called?

 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2655.patch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2655) Get deletes working in the realtime branch

2010-10-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917076#action_12917076
 ] 

Jason Rutherglen commented on LUCENE-2655:
--

Ok, I have been stuck/excited about not having to use/understand the 
remap-docids method, because it's hard to debug.  However I see what you're 
saying, and why remap-docids exists.  I'll push the DWP buffered deletes to the 
flushed deletes.  

bq. we'll pay huge cost opening that massive grandaddy segment 

This large cost is from loading the terms index and deleted docs?   When those 
large segments are merged though, the IO cost is so substantial that loading 
tii or del into RAM probably doesn't account for much of the aggregate IO, 
they're probably in the noise?  Or are you referring to the NRT apply deletes 
flush, however that is on a presumably pooled reader?  Or you're just saying 
that today we're applying deletes across the board to all segments prior to a 
merge, regardless of whether or not they're even involved in the merge?  It 
seems like that is changeable?

 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2655.patch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-29 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916199#action_12916199
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

bq. We'd need to increase the level 0 slice size...

Yes. 

{quote}but the reader needs to read 'beyond' the end of a given
slice, still? Ie say global maxDoc is 42, and a given posting
just read doc 27 (which in fact is its last doc). It would then
try to read the next doc?{quote}

The posting-upto should stop the reader prior to reaching a byte
element whose value is 0, ie, it should never happen.

The main 'issue', which really isn't one, is that each reader
cannot maintain a copy of the byte[][] spine as it'll be
growing. New buffers will be added and the master posting-upto
will also be changing, therefore allowing 'older' readers to
possibly continue past their original point-in-time byte[][].
This is solved by adding synchronized around the obtainment of
the byte[] buffer from the BBP, thereby preventing out of bounds
exceptions.

{quote}We don't store tf now do we? Adding 4 bytes per unique
term isn't innocuous!{quote}

What I meant is, if we're merely maintaining the term freq array
during normal, non-RT indexing, then we're not constantly
creating new arrays, we're in innocuous land, though there is no
use for the array in this case, eg, it shouldn't be created
unless RT had been flipped on, modally. 

{quote}Hmm the full copy of the tf parallal array is going to
put a highish cost on reopen? So some some of transactional
(incremental copy-on-write) data structure is needed (eg
PagedInts)...{quote}

Right, this to me is the remaining 'problem', or rather
something that needs a reasonable go-ahead solution. For now we
can assume PagedInts is the answer.

In addition, to summarize the skip list. It needs to store the
doc, address into the BBP, and the length to the end of the
slice from the given address. This allows us to point to a
document anywhere in the postings BBP, and still continue with
slice iteration. In the test code I've written, the slice level
is stored as well, I'm not sure why/if that's required. I think
it's a hint to the BBP reader as to the level of the next slice.



 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2662) BytesHash

2010-09-29 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916355#action_12916355
 ] 

Jason Rutherglen commented on LUCENE-2662:
--

{quote}we could factor out a super class from ParallelPostingArray which only 
has the textStart int array, the grow and copy method and let 
ParallelPostingArray subclass it. {quote}

This option, makes the most sense.  ParallelByteStartsArray?





 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch, 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Minor
 Fix For: Realtime Branch, 4.0

 Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch, 
 LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2655) Get deletes working in the realtime branch

2010-09-29 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2655:
-

Attachment: LUCENE-2655.patch

Here's a basic patch with a test case showing delete by term not working.  It's 
simply not finding the docs in the reader in apply deletes, I'm guessing it's 
something basic that's wrong.

 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2655.patch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2655) Get deletes working in the realtime branch

2010-09-29 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916366#action_12916366
 ] 

Jason Rutherglen commented on LUCENE-2655:
--

Maybe I was using the new flex API wrongly, when I pass in the deleted docs to 
MultiFields.getTermDocsEnum, the test case passes.

 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2655.patch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2655) Get deletes working in the realtime branch

2010-09-29 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916367#action_12916367
 ] 

Jason Rutherglen commented on LUCENE-2655:
--

I'm only seeing this test from TestIndexWriterDelete fail:

[junit] junit.framework.AssertionFailedError: expected:3 but was:0
[junit] at 
org.apache.lucene.index.TestIndexWriterDelete.testMaxBufferedDeletes(TestIndexWriterDelete.java:118)
[junit] at 
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:328)


 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2655.patch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2655) Get deletes working in the realtime branch

2010-09-29 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916370#action_12916370
 ] 

Jason Rutherglen commented on LUCENE-2655:
--

There's this one as well, which I'll focus on, though it's an error in 
IR.isCurrent, it doesn't immediately appear to be related to deletes.
{code}
[junit] Testsuite: org.apache.lucene.index.TestIndexWriterReader
[junit] Testcase: 
testUpdateDocument(org.apache.lucene.index.TestIndexWriterReader):FAILED
[junit] null
[junit] junit.framework.AssertionFailedError: null
[junit] at 
org.apache.lucene.index.TestIndexWriterReader.testUpdateDocument(TestIndexWriterReader.java:104)
[junit] at 
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:328)
{code}

 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2655.patch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-28 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915797#action_12915797
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

{quote}Hmm so we also copy-on-write a given byte[] block? Is
this because JMM can't make the guarantees we need about other
threads reading the bytes written?{quote}

Correct. The example of where everything could go wrong is the
rewriting of a byte slice forwarding address while a reader is
traversing the same slice. The forwarding address could be
half-written, and suddenly we're bowling in lane 6 when we
should be in lane 9. By making a [read-only] ref copy of the
byte[]s we're ensuring that the byte[]s are in a consistent
state while being read.

So I'm using a boolean[] to tell the writer whether it needs to
make a copy of the byte[]. The boolean[] also tells the writer
if it's already made a copy. Whereas in IndexReader.clone we're
keeping ref counts of the norms byte[], and decrementing each
time we make a copy until finally it's 0, and then we give it to
the GC (here we'd do the same or give it back to the allocator). 

{quote}But even if we do reuse, we will cause tons of garbage,
until the still-open readers are closed? Ie we cannot re-use the
byte[] being held open by any NRT reader that's still
referencing the in-RAM segment after that segment had been
flushed to disk.{quote}

If we do pool, it won't be very difficult to implement, we have
a single point of check-in/out of the byte[]s in the allocator
class.

In terms of the first implementation, by all means we should
minimize tricky areas of the code by not implementing skip
lists and byte[] pooling.

{quote}It's not like 3.x's situation with FieldCache or terms
dict index, for example{quote}

What's the GC issue with FieldCache and terms dict?

{quote}BTW I'm assuming IW will now be modal? Ie caller must
tell IW up front if NRT readers will be used? Because non-NRT
users shouldn't have to pay all this added RAM cost?{quote}

At present it's still all on demand. Skip lists will require
going modal because we need to build those upfront (well we
could go back and build them on demand, that'd be fun). There's
the term-freq parallel array, however if getReader is never
called, it's a single additional array that's essentially
innocuous, if useful.

{quote}Hmm your'e right that each reader needs a private copy,
to remain truly point in time. This (4 bytes per unique term X
number of readers reading that term) is a non-trivial addition
of RAM.{quote}

PagedInt time? However even that's not going to help much if in
between getReader calls, 10,000s of terms were seen, we could
have updated 1000s of pages. AtomicIntArray does not help
because concurrency isn't the issue, it's point-in-timeness
that's required. Still I guess PagedInt won't hurt, and in the
case of minimal term freq changes, we'd still be potentially
saving RAM. Is there some other data structure we could pull out
of a hat and use?



 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-28 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916000#action_12916000
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

OK, I think there's a solution to copying the actual byte[],
we'd need to alter the behavior of BBPs. It would require always
allocating 3 empty bytes at the end of a slice for the
forwarding address, rather than what we do today, which is write
the postings up to the end of the slice, then when allocating a
new slice, copying the last 3 bytes forward to the new slice
location. We would also need to pass a unique parallel posting
upto array to each reader. This is required so that the reader
never ventures beyond the end of a slice, as the slice was
written when the reader was instantiated.

This would yield significant savings because we would not be
generating garbage from the byte[]s, which are 32 KB each. They
add up if the indexing is touching many different byte[]s for
example. With this solution, there would essentially not be any
garbage generated from incremental indexing, only after a DWPTs
segment is flushed (and all readers were also GCed). 

The only downside is we'd be leaving those 3 bytes per term
unallocated at all times, that's not a very high price. Perhaps
more impacting is the posting upto array per reader, which'd be
4 bytes per term, the same cost as the term freq array. It's a
pick your poison problem.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-28 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916001#action_12916001
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

I guess another possible solution is to do away with interleaved slices 
altogether and simply allocate byte[]s per term and chain them together.  Then 
we would not need to worry about concurrency with slicing.  This would 
certainly make debugging easier however it'd add 8 bytes (for the object 
pointer) per term, somewhat negating the parallel array cutover.  Perhaps it's 
just a price we'd want to pay.  That and we'd probably still need a unique 
posting upto array per reader.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-28 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916005#action_12916005
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

The last comment shows the brain is tired, ie, ignore it because
there would be too many pointers for the byte[]s. 

The comment prior however will probably work, and I think
there's a solution to excessive posting-upto int[] per reader
generation. If when getReader is called, we copy a writable
posting-upto array to a single master posting-upto parallel
array, then we will not need to create a unique int[] per
reader. The reason this would work is, past readers that are
iterating their term docs concurrently with the change to the
posting-upto array, will stop at the maxdoc anyways. This'll
be fun to implement.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-27 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915137#action_12915137
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

There's a little error in thinking of the last comment.  Also, the best 
solution is probably to store the length of the posting slice into the skip 
list byte pool.  This'll mean a slight modification to byte slice reader, 
however I think it'll work.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2662) BytesHash

2010-09-26 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915079#action_12915079
 ] 

Jason Rutherglen commented on LUCENE-2662:
--

Simon, the patch looks like it's ready for the next stage, ie, 
TermsHashPerField deparchment.  

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch, 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Minor
 Fix For: Realtime Branch, 4.0

 Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-26 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915130#action_12915130
 ] 

Jason Rutherglen edited comment on LUCENE-2575 at 9/27/10 1:44 AM:
---

Here are the new parallel arrays.  It seems like something went wrong and there 
are too many, however I think each is required.

{code}
final int[] skipStarts; // address where the term's skip list starts (for 
reading)
final int[] skipAddrs; // where writing left off
final int[] sliceAddrs; // the start addr of the last posting slice
final byte[] sliceLevels; // posting slice levels
final int[] skipLastDoc; // last skip doc written
final int[] skipLastAddr; // last skip addr written
{code}

In regards to writing into the skip list the start address of
the first level 9 posting slice: Because we're writing vints
into the posting slices, and vints may span more than 1 byte, we
may (and this has happened in testing) write a vint that spans
slices, so if we record the last slice address and read a vint
from that point, we'll get an incorrect vint. If we start 1+
bytes into a slice, we will not know where the slice ends
(because we are assuming they're 200 bytes in length). Perhaps
in the slice address parallel array we can somehow encode the
first slice's length, or add yet another parallel array for the
length of the first slice.  Something to think about.

  was (Author: jasonrutherglen):
Here are the new parallel arrays.  It seems like something went wrong and 
there are too many, however I think each is required.

{code}
final int[] skipStarts; // address where the term's skip list starts (for 
reading)
final int[] skipAddrs; // where writing left off
final int[] sliceAddrs; // the start addr of the last posting slice
final byte[] sliceLevels; // posting slice levels
final int[] skipLastDoc; // last skip doc written
final int[] skipLastAddr; // last skip addr written
{code}

In regards to writing into the skip list the start address of
the first level 9 posting slice: Because we're writing vints
into the posting slices, and vints may span more than 1 byte, we
may (and this has happened in testing) write a vint that spans
slices, so if we record the last slice address and read a vint
from that point, we'll get an incorrect vint. If we start 1+
bytes into a slice, we will not know where the slice ends
(because we are assuming they're 200 bytes in length). Perhaps
in the slice address parallel array we can somehow encode the
first slice's length, or add yet another parallel array for the
length of the first slice.  Something to think about.

We can't simply read
ahead 200 bytes (ie, level 9), nor can
  
 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914838#action_12914838
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

{quote}Can you explain what's the copy on write ByteBlockPool?
Exactly when do we make a copy? {quote}

A copy of the byte[][] refs is made when getReader is called.
Each DWPT is locked, eg, writes stop, a copy of the byte[][] is
made (just the refs) for that reader. I think the issue at the
moment is I'm using a boolean[] to signify if a byte[] needs to
be copied before being written to. As with BV and norms cloning,
read-only references are carried forward, which would imply
making copies of the boolean[] as well. In other words, as with
BV and norms, I think we need ref counts to the individual
byte[]s so that read-only references to byte[]s are carried
forward properly. However this implies creating a BytesRefCount
object because a parallel array cannot point back to the same
underlying byte[] if the byte[] in the byte[][] can be replaced
when a copy is made. 

{quote}Do we have a design thought out for this? The challenge
is because every doc state now has its own private docID
stream{quote}

It sounded easy when I first heard it, however, I needed to
write it down to fully understand and work through what's going
on. That process is located in LUCENE-2558. 

{quote}Well, I was thinking only implement the single-level skip
case (since it ought to be alot simpler than the
MLSLW/R){quote}

I started on this, eg, implementing a single-level skip list
that reads and writes from the BBP. It's a good lesson in how to
use the BBP.

{quote}Actually, conjunction (AND) queries, and also
PhraseQuery{quote}

Both very common types of queries, so we probably need some type
of skipping, which we will, it'll just be single-level.

{quote}Probably we should stop reusing the byte[] with this
change? So when all readers using a given byte[] are finally
GCd, is when that byte[] is reclaimed.{quote}

I have a suspicion we'll change our minds about pooling byte[]s.
We may end up implementing ref counting anyways (as described
above), and the sudden garbage generated *could* be a massive
change for users? Of course ref counting was difficult to
implement the first time around in LUCENE-1314, perhaps however
it'll be easier the 2nd time. 

As a side note, there is still an issue in my mind around the
term frequencies parallel array (introduced in these patches),
in that we'd need to make a copy of it for each reader (because
if it changes, the scoring model becomes inaccurate?). However,
we could in fact use a 2 dimensional PagedBytes (in this case,
PagesInts) for this purpose. Or is the garbage of an int[] the
size of the number of docs OK per reader? There is also the
lookup cost to consider.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914860#action_12914860
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

Further thoughts on ref counting the byte[]s.  If we add a BytesRefCount (or 
some other similarly named class that I want to call BytesRef, though I can't 
use because that's taken), then I think adding 4 bytes for the int count 
variable, 8 bytes for the byte[] pointer, is 12 bytes total added to a 32k (ie, 
32768 len) byte[] really too much?  I don't think so.  

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914863#action_12914863
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

In regards to the performance effects on writes of obtaining the reader from 
each DWPT, there should not be any, because it is the thread calling getReader 
that will wait for the lock on the DWPT in between doc adds.  The copy-on-write 
is it's most primitive form, is a copy of object references, eg, the cost is 
extremely low.  And so I do not think indexing performance will be affected 
whatsoever by the copy-on-write approach.  Of course we'll need to benchmark to 
verify.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2662) BytesHash

2010-09-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914888#action_12914888
 ] 

Jason Rutherglen commented on LUCENE-2662:
--

An API change to BBP that would be useful is instead of passing in the size in 
bytes to newSlice, it'd be more useful if the level and/or the size were 
passed in.  In fact, throughout the codebase, a level, specifically the first, 
is all that is passed into the newSlice method.  The utility of this change is, 
I'm recording the level of the last slice for the skip list in LUCENE-2312.

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch, 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Minor
 Fix For: Realtime Branch, 4.0

 Attachments: LUCENE-2662.patch, LUCENE-2662.patch, LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914902#action_12914902
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

The RAM buffer single-level skip list writer probably requires two additional 
parallel arrays.  One for the beginning address into the skip list BBP.  The 
second for the address upto, where the last skip list entry that was written 
left off.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2662) BytesHash

2010-09-24 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914478#action_12914478
 ] 

Jason Rutherglen commented on LUCENE-2662:
--

 BytesRefHash is now final and does not create Entry objects anymore

That's good.

 move ByteBlockPool to o.a.l.utils

Sure why not.

 factoring it out of TermsHashPerField, the next question is are we gonna do 
 that in a different issue and get this committed first?

We need to factor it out of THPF otherwise this patch isn't really useful for 
committing.  Also, it'll get tested through the entirety of the unit tests, ie, 
it'll get put through the laundry.  

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch, 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Minor
 Fix For: Realtime Branch, 4.0

 Attachments: LUCENE-2662.patch, LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2662) BytesHash

2010-09-24 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914521#action_12914521
 ] 

Jason Rutherglen commented on LUCENE-2662:
--

bq. make sure JIT doesn't play nasty tricks with us again.

What would we do if this happens?

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch, 4.0
Reporter: Jason Rutherglen
Assignee: Simon Willnauer
Priority: Minor
 Fix For: Realtime Branch, 4.0

 Attachments: LUCENE-2662.patch, LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-24 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914607#action_12914607
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

The current MultiLevelSkipList* system relies on writing out
fixed length skip list buffers before they are readable. This
obviously will not work for RT so I'm working on modifying MLSL
into new class(es) that writes and reads from the concurrent-ish
BBP. 

In trunk, each level is a RAMOutputStream, that'll nee to
changechange, and each level will likely be a stream keyed into
the BBP. A question is whether we will statically assign the
number of levels prior to the creation of the MLSL, or will we
need to somehow make the number of levels dynamic, in which case
using streams becomes slightly more complicated.



 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-24 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914607#action_12914607
 ] 

Jason Rutherglen edited comment on LUCENE-2575 at 9/24/10 3:28 PM:
---

The current MultiLevelSkipList* system relies on writing out
fixed length skip list buffers before they are readable. This
obviously will not work for RT so I'm working on modifying MLSL
into new class(es) that writes and reads from the concurrent-ish
BBP. 

In trunk, each level is a RAMOutputStream, that'll need to
change, and each level will likely be a stream keyed into
the BBP. A question is whether we will statically assign the
number of levels prior to the creation of the MLSL, or will we
need to somehow make the number of levels dynamic, in which case
using streams becomes slightly more complicated.



  was (Author: jasonrutherglen):
The current MultiLevelSkipList* system relies on writing out
fixed length skip list buffers before they are readable. This
obviously will not work for RT so I'm working on modifying MLSL
into new class(es) that writes and reads from the concurrent-ish
BBP. 

In trunk, each level is a RAMOutputStream, that'll nee to
changechange, and each level will likely be a stream keyed into
the BBP. A question is whether we will statically assign the
number of levels prior to the creation of the MLSL, or will we
need to somehow make the number of levels dynamic, in which case
using streams becomes slightly more complicated.


  
 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-24 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914684#action_12914684
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

{quote}Maybe we can not skip until we've hit the max slice? This
way skipping would always know it's on the max slice. This works
out to 429 bytes into the stream... likely this is fine. {quote}

Me like-y. I'll implement the skip list to point to the largest
level slices.

{quote}Can we just have IW allocate a new byte[][] after flush?
So then any open readers can keep using the one they have?{quote}

This means the prior byte[]s will still be recycled after all
active previous flush readers are closed? If there are multiple
readers from the previous flush, we'd probably still need
reference counting (ala bitvector and norms)? Unfortunately a
reference count parallel array will not quite work because we're
copy-on-writing the byte[]s, eg, there's nothing consistent for
the index numeral to point to. A hash map of byte[]s would
likely be too heavyweight? We may need to implement a ByteArray
object composed of a byte[] and a refcount. This is somewhat
counter to our parallel array memory savings strategy, though it
is directly analogous to the way norms are implemented in
SegmentReader.

{quote}it's possible single level skipping, with a larger skip
interval, is fine for even large RAM buffers.{quote}

True, I'll implement a default of one level, and a default
large-ish skip interval.

{quote}Maybe we can get an initial version of this working,
without the skipping? Ie skipping is implemented as scanning.
{quote}

How many scorers, or how often is skipping used? It's mostly for
disjunction queries? If we limit the skip level to one, and not
implement the BBP level byte at the beginning of the slice, the
MLSL will be a lot easier (ie faster) to implement and test. 

I'd like to see BytesHash get out of THPF (eg, LUCENE-2662), get
deletes working in the RT branch, and merge the flush by DWPT to
trunk. Concurrently I'll work on the search on the RAM buffer
which is most of the way completed. I'd prefer to test a more
complete version of LUCENE-2312 with skip lists (which can
easily be turned off), so that when we do take it through the
laundromat of testing, we won't need to retrofit anything back
in, re-test, and possibly re-design. 

On a side note related to testing: One naive way I've tested is
to do the copy-on-write of the BBP when the segment needs to be
flushed to disk, and write the segment from the read-only copy
of the BBP. If the segment is correct, then at least we know the
copy worked properly and nothing's missing.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2573) Tiered flushing of DWPTs by RAM with low/high water marks

2010-09-23 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914296#action_12914296
 ] 

Jason Rutherglen commented on LUCENE-2573:
--

I was hoping something clever would come to me about how to unit test this, 
nothing has.  We can do the slowdown of writes to the file(s) via a 
Thread.sleep, however this will only emulate a real file system in RAM, what 
then?  I thought about testing the percentage however is it going to be exact?  
We could test a percentage range of each of the segments flushed?  I guess I 
just need to run the all of the unit tests, however some of those will fail 
because deletes aren't working properly yet.  

 Tiered flushing of DWPTs by RAM with low/high water marks
 -

 Key: LUCENE-2573
 URL: https://issues.apache.org/jira/browse/LUCENE-2573
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch


 Now that we have DocumentsWriterPerThreads we need to track total consumed 
 RAM across all DWPTs.
 A flushing strategy idea that was discussed in LUCENE-2324 was to use a 
 tiered approach:  
 - Flush the first DWPT at a low water mark (e.g. at 90% of allowed RAM)
 - Flush all DWPTs at a high water mark (e.g. at 110%)
 - Use linear steps in between high and low watermark:  E.g. when 5 DWPTs are 
 used, flush at 90%, 95%, 100%, 105% and 110%.
 Should we allow the user to configure the low and high water mark values 
 explicitly using total values (e.g. low water mark at 120MB, high water mark 
 at 140MB)?  Or shall we keep for simplicity the single setRAMBufferSizeMB() 
 config method and use something like 90% and 110% for the water marks?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2662) BytesHash

2010-09-22 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2662:
-

Attachment: LUCENE-2662.patch

We need unit tests and a base implementation as BytesHash is abstract...

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2662) BytesHash

2010-09-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913589#action_12913589
 ] 

Jason Rutherglen commented on LUCENE-2662:
--

The current hash implementation needs to be separated out of TermsHashPerField. 
 

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2662) BytesHash

2010-09-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913638#action_12913638
 ] 

Jason Rutherglen commented on LUCENE-2662:
--

Simon, when do you think you'll be posting?

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2662) BytesHash

2010-09-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913651#action_12913651
 ] 

Jason Rutherglen commented on LUCENE-2662:
--

It'd be nice to get deletes working, ie, LUCENE-2655 and move forward in a way 
that's useful long term.  What changes have you made?

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2662.patch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913383#action_12913383
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

This issue is blocked because the change made to ByteBlockPool to add the level 
of the slice, to the beginning of the slice, moves all of the positions forward 
by one.  This has caused TestByteSlices to fail an assertion.  I'm not sure if 
the test needs to be changed, or there's a bug in the new BBP implementation.  
Either way it's a bit of a challenge to debug.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-09-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913389#action_12913389
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

The patches I've been submitting to LUCENE-2575 probably should go here.  Once 
the new byte block pool that records the slice level at the beginning of the 
slice is finished, the skip list can be completed, and then the basic 
functionality for searching on the RAM buffer will be done.  At that point the 
concurrency and memory efficiency may be focused on and tested.  In addition 
the deletes must be implemented.

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2567) RT Terms Dictionary

2010-09-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913385#action_12913385
 ] 

Jason Rutherglen edited comment on LUCENE-2567 at 9/22/10 12:34 AM:


The RT terms dict has been introduced in the LUCENE-2575 patches.  I may end up 
closing this issue, or if needed moving the terms dict code from LUCENE-2575.

  was (Author: jasonrutherglen):
The RT terms dict has been introduced in the LUCENE-2575 patches.  I may 
end up closing this issue, or if needed moving the terms dict code from 
LUCENE-2575 if needed.
  
 RT Terms Dictionary
 ---

 Key: LUCENE-2567
 URL: https://issues.apache.org/jira/browse/LUCENE-2567
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch


 Implement an in RAM terms dictionary for realtime search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-09-21 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2312:
-

Affects Version/s: Realtime Branch
   (was: 3.0.1)

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2558) Use sequence ids for deleted docs

2010-09-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913400#action_12913400
 ] 

Jason Rutherglen commented on LUCENE-2558:
--

For the deleted docs sequence id array, perhaps I'm a little bit
confused, but how will we signify in the sequence id array if a
document is deleted? I believe we need a secondary sequence id
array for deleted docs that is init'd to -1. When a document is
deleted, the sequence id is set for that doc in the
del-docs-seq-arr. When the deleted docs Bits is being accessed,
for a given doc, we'll compare the IRs seq-id-up-to with the
del-docs-seq-id, and if the IR seq-id is greater than or equal
to, the Bits.get method will return true, meaning the document
is deleted. 

I am forgetting how concurrency will work in this case, ie,
insuring multi-threaded visibility due to the JMM. Actually,
because we're pausing the writes/deletes when get reader is
called on the DWPT, JMM concurrency should be OK.

 Use sequence ids for deleted docs
 -

 Key: LUCENE-2558
 URL: https://issues.apache.org/jira/browse/LUCENE-2558
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Realtime Branch


 Utilizing the sequence ids created via the update document
 methods, we will enable IndexReader deleted docs over a sequence
 id array. 
 One of the decisions is what primitive type to use. We can start
 off with an int[], then possibly move to a short[] (for lower
 memory consumption) that wraps around.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2662) BytesHash

2010-09-21 Thread Jason Rutherglen (JIRA)
BytesHash
-

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch


This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2662) BytesHash

2010-09-21 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2662:
-

Priority: Minor  (was: Major)

 BytesHash
 -

 Key: LUCENE-2662
 URL: https://issues.apache.org/jira/browse/LUCENE-2662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Realtime Branch


 This issue will have the BytesHash separated out from LUCENE-2186

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913403#action_12913403
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

A further question for this issue, in regards to copy-on-write
of the 1st dimension of the byte[][] array, will we want to keep
a count of references to the byte array, in the case of, lets
say multiple readers keeping references to each individual byte
array (the one with the bytes data). Assuming we will want to
continue to pool the byte[]s, I think we'll need to use
reference counting, or simply not pool the byte[]s after
flushing, in order to avoid overwriting of arrays.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-09-20 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912497#action_12912497
 ] 

Jason Rutherglen commented on SOLR-1301:


Alexander,

I think we'll need to use Hadoop's Mini Cluster in order to have a proper unit 
test.  Adding Jetty as a dependency shouldn't be too much of a problem as Solr 
already includes a small version of Jetty?  That being said, it doesn't mean 
it's fun to write the unit test.  I can assist if needed.

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: Next

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-09-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912312#action_12912312
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Simon, on second thought, lets go ahead and factor out BytesHash, do you want 
to submit a patch for the realtime patch and post it here or should I?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2655) Get deletes working in the realtime branch

2010-09-19 Thread Jason Rutherglen (JIRA)
Get deletes working in the realtime branch
--

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch


Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-09-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912313#action_12912313
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

I opened an issue for the deletes LUCENE-2655

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-09-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912312#action_12912312
 ] 

Jason Rutherglen edited comment on LUCENE-2324 at 9/19/10 10:52 PM:


Simon, on second thought, lets go ahead and factor out BytesHash, do you want 
to submit a patch for the realtime branch and post it here or should I?

  was (Author: jasonrutherglen):
Simon, on second thought, lets go ahead and factor out BytesHash, do you 
want to submit a patch for the realtime patch and post it here or should I?
  
 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2655) Get deletes working in the realtime branch

2010-09-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912315#action_12912315
 ] 

Jason Rutherglen commented on LUCENE-2655:
--

Here's a relevant comment from LUCENE-2324:

https://issues.apache.org/jira/browse/LUCENE-2324?focusedCommentId=12891256page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12891256

 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2655) Get deletes working in the realtime branch

2010-09-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912319#action_12912319
 ] 

Jason Rutherglen commented on LUCENE-2655:
--

{quote}Maybe we could reuse (factor out) TermsHashPerField's
custom hash here, for the buffered Terms? It efficiently maps a
BytesRef -- int.{quote}

I'm trying to get a feel for what we kind of deletes we want
working in the flush by DWPT merge viz-a-viz the realtime branch
(ie, the release where we have the new realtime search/indexing
functionality). 

Factoring out BytesHash and storing the terms in a byte block
pool will allow replacing the current hash map of terms and
likely conserve more RAM. Will we need to replace doc id upto
and instead use sequence ids? 

We can additionally, for now, implement flushing deletes on
merges, like today, for the flush by DWPT merge to trunk. Then
implement live, aka foreground deletes, for the realtime search
branch merge to trunk. 

 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2655) Get deletes working in the realtime branch

2010-09-19 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912339#action_12912339
 ] 

Jason Rutherglen commented on LUCENE-2655:
--

A couple more things... 

The BytesHash or some other aptly named class can implement the
quick sort of terms, again from TermsHashField, which will
replace the sorted terms map used in the RT branch for deletes.
The RT branch isn't yet calculating the RAM usage of deletes. By
using the byte block pool, calculating the RAM usage will be
trivial (as the the BBP automatically records the num bytes used). 

The RT branch has an implementation of delete using the min/max
sequence ids for a given segment. What else is needed? 

 Get deletes working in the realtime branch
 --

 Key: LUCENE-2655
 URL: https://issues.apache.org/jira/browse/LUCENE-2655
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch


 Deletes don't work anymore, a patch here will fix this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-09-18 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12911144#action_12911144
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Simon, I think the BytesHash being factored is useful, though not a must have 
for committing the flush by DWPT code.

Mike, I need to finish the unit tests for LUCENE-2573.

Michael, what is the issue with deletes?  We don't need deletes to use sequence 
ids yet?  Maybe we should open a separate issue to make deletes work for the 
realtime/DWPT branch?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910201#action_12910201
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

bq. we know what size (level + 1, ceiling'd) to make the next slice.

Thanks.  In the midst of debugging last night I realized this.  The next 
question is whether to remove it.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-09-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910268#action_12910268
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

bq. I think the sync'd flush is a big bottleneck

Is this because indexing stops while the DWPT segment is being flushed to disk 
or are you referring to a different sync?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909771#action_12909771
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

Because of the way byte slices work, eg, they need to pre-know
the size of the slice before iterating on it, we can't simply
point to the middle of a slice and read without probably
iterating over the forwarding address.

It seems the skip list will need to point to the beginning of a
slice. This'll make the interval iteration in the RAM buffer
skip list writer a little more complicated than today in that
it'll need to store positions that are the start of byte slices.
In other words, the intervals will be slightly uneven at times.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909839#action_12909839
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

Is there a way to know the level of a slice given only the forwarding 
address/position?  It doesn't look like it.  Hmm... This could mean encoding 
the level or the size of the slice into the slice, which would elongate slices 
in general, I suppose though that the level index would only add one byte and 
that would be okay. 

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909852#action_12909852
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

In the following line of ByteBlockPool.allocSlice we're recording the slice 
level, however it's at the end of the slice rather than the beginning, which is 
where we'll need to write the level in order to implement slice seek.  I'm not 
immediately sure what's reading the level at this end position of the byte[].

{code}
buffer[byteUpto-1] = (byte) (16|newLevel);
{code}

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-14 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2575:
-

Attachment: LUCENE-2575.patch

Term frequency is recorded and returned.  There are Terms, TermsEnum, DocsEnum 
implementations.  Needs the term vectors, doc stores exposed via the RAM 
reader, concurrency unit tests, and a payload unit test.  Still quite rough.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-14 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2575:
-

Attachment: LUCENE-2575.patch

Added a unit test for payloads, term vectors, and doc stores.  The reader 
flushes term vectors and doc stores on demand, once per reader.  Also, little 
things are getting cleaned up in the realtime branch.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-14 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909580#action_12909580
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

For the posting skip list we need to implement seek on the
ByteSliceReader. However if we're rewriting a portion of a
slice, then I guess we could have a problem... Meaning we'd be
storing an absolute position in the skip list, and we could go
to look up the value, however that byte(s) could have been
altered to not be delta encoded doc ids anymore, but instead
is/are the forwarding address to the next slice. 

Do we need an intelligent mechanism that interacts with the byte
slice writer to not point at byte array elements (ie the end of
slices) that could later be converted into forwarding addresses?

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908849#action_12908849
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

One thing I noticed, correct me if I'm wrong, is the term doc
frequency (the one stored per term, ie, TermsEnum.docFreq)
doesn't seem to be currently recorded in the ram buffer code
tree. It will be easy to add, though if we make it accurate per
RAM index reader then we could be allocating a unique array, the
length of the number of terms, per reader. I'll implement it
this way to start and we can change it later if necessary.
Actually, to save RAM this could be another use case where a 2
dimensional copy-on-write array is practical.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-11 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2575:
-

Attachment: LUCENE-2575.patch

Here's a start at concurrency, the terms dictionary, and
iterating over doc ids. 

* It needs concurrency unit tests

* At an as yet undetermined interval, we need to conglomerate
the existing terms into a sorted int[] rather than continue to
use the ConcurrentSkipListMap, which consumes a far greater
amount of RAM. The tradeoff and reason for using the CSLM is the
level of concurrency gained by using it at the cost of greater
memory consumption when compared with the sorted int[] of term
ids.

* An int[] based term enum needs to be implemented. In addition,
a multi term enum, maybe there's one we can use, I'm not
familiar enough with the new flex code base.

* Copy on write is used to obtain a read-only version of the
ByteBlockPool and IntBlockPool. In the case of the byte blocks,
a boolean[] marks which elements need to be copied prior to
writing by the DocumentsWriterPerThread on byte slice forwarding
address rewrite.

* A write lock on each DWPT guarantees that as reference copies
are made, arrays being copied will not be altered in flight.
There shouldn't be an issue even though to get a complete
IndexReader[], we need to wait for each document to finish
flushing, we're not blocking indexing, only the obtaining of the
IRs. I can't see this being an issue for most use cases.

* Similarly, a reference is copied of the ParallelPostingsArray
(rather than a full copy) for use by the RAM Buffer based
IndexReader. It is OK for the PPA to be changed during future doc
adds, as the only the elements greater than the IRs max term id
will be altered, ie, we're not going to run into JMM thread
issues because the writing and read-only array reference copies
occur in a reentrant lock.

* Recycling of byte[]s becomes a bit more complex as RAM IRs will
likely hold references to them. When the RAM IR is closed, however,
the byte[]s can be recycled. The user could experience unusual
RAM usage spikes if IRs are not closed properly.



 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-11 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2575:
-

Attachment: LUCENE-2575.patch

This includes a basic implementation of the sorted term id based
term enum. We'll want to over-allocate the sorted term id array
so that future merges of new term ids will not require
allocating a new array for growth. I think overall the ram
buffer based searching will not require too much more of a RAM
outlay. The merging of new term ids could occur in a background
thread if we think it's expensive, however for now we can simply
merge them in on demand as new RAM readers are created.

Seek is implemented as a binary search of the sorted term ids.
If this is not efficient enough, we can implement a terms index
ala the current system.

For now the conversion from CSLM to sorted term id array can be
a percentage of the total number of terms, which I'll default to
10%. We may want to make this a function (eg, percentage) of RAM
consumption in the future.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907493#action_12907493
 ] 

Jason Rutherglen commented on LUCENE-2575:
--

I'm finally understanding the slice concept, basically we're
over-allocating space within the ByteBlockPool byte[]s for more
postings for a particular term, hence the levelSizeArray which
determines the length of each slice of a byte[] the postings
will use. They're probably not always filled in completely?

It's a bit tricky to follow by reading the code, which makes
figuring out how to make the RAM buffer concurrent challenging.
Especially in the newSlice method which rewrites the end of the
last slice with the forwarding index/address of the next slice.
It's very clever however maybe we can encapsulate it better with
methods delineating the various operations which right now are
operations directly on the assortment of arrays. In general we
can possibly get away with using copy-on-write to achieve
performant single-threaded write and multi-threaded reader
concurrency.



 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2573) Tiered flushing of DWPTs by RAM with low/high water marks

2010-09-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906798#action_12906798
 ] 

Jason Rutherglen commented on LUCENE-2573:
--

bq. shouldn't tiered flushing take care of this

Faulty thinking for a few minutes.

{quote}but this won't be most efficient, in general? Ie we could end up 
creating tiny segments depending on luck-of-the-thread-scheduling?{quote}

True.  Instead, we may want to simply not-flush the current DWPT if it is in 
fact not the highest RAM user.  When addDoc is called on the thread with the 
highest RAM usage, we can then flush it.

bq. there's no longer a need to track per-doc pending RAM

I'll remove it from the code.

{quote}If a buffer is not in the pool (ie not free), then it's in use and we 
count that as RAM used{quote}

Ok, I'll make the change.  

{quote}we have to track net allocated, in order to trim the buffers (drop them, 
so GC can reclaim) when we are over the .setRAMBufferSizeMB{quote}

I haven't seen this in the realtime branch.  Reclamation of extra allocated 
free blocks may need to be reimplemented.  

I'll increment num bytes used when a block is returned for use.

On this topic, do you have any thoughts yet about how to make the block pools 
concurrent?  I'm still leaning towards a random access file (seek style) 
interface because this is easy to make concurrent, and hides the underlying 
block management mechanism, rather than directly exposes it like today, which 
can lend itself to problematic usage in the future.

 Tiered flushing of DWPTs by RAM with low/high water marks
 -

 Key: LUCENE-2573
 URL: https://issues.apache.org/jira/browse/LUCENE-2573
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2573.patch


 Now that we have DocumentsWriterPerThreads we need to track total consumed 
 RAM across all DWPTs.
 A flushing strategy idea that was discussed in LUCENE-2324 was to use a 
 tiered approach:  
 - Flush the first DWPT at a low water mark (e.g. at 90% of allowed RAM)
 - Flush all DWPTs at a high water mark (e.g. at 110%)
 - Use linear steps in between high and low watermark:  E.g. when 5 DWPTs are 
 used, flush at 90%, 95%, 100%, 105% and 110%.
 Should we allow the user to configure the low and high water mark values 
 explicitly using total values (e.g. low water mark at 120MB, high water mark 
 at 140MB)?  Or shall we keep for simplicity the single setRAMBufferSizeMB() 
 config method and use something like 90% and 110% for the water marks?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2573) Tiered flushing of DWPTs by RAM with low/high water marks

2010-09-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906801#action_12906801
 ] 

Jason Rutherglen commented on LUCENE-2573:
--

bq. We can modify MockRAMDir to optionally take its sweet time when writing 
certain files?

Yes, I think we need to implement something of this nature.  We *could* even 
randomly assign a different delay value per flush.  Of course how the test 
would instigate this from outside of DW, is somewhat of a different issue.

 Tiered flushing of DWPTs by RAM with low/high water marks
 -

 Key: LUCENE-2573
 URL: https://issues.apache.org/jira/browse/LUCENE-2573
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2573.patch


 Now that we have DocumentsWriterPerThreads we need to track total consumed 
 RAM across all DWPTs.
 A flushing strategy idea that was discussed in LUCENE-2324 was to use a 
 tiered approach:  
 - Flush the first DWPT at a low water mark (e.g. at 90% of allowed RAM)
 - Flush all DWPTs at a high water mark (e.g. at 110%)
 - Use linear steps in between high and low watermark:  E.g. when 5 DWPTs are 
 used, flush at 90%, 95%, 100%, 105% and 110%.
 Should we allow the user to configure the low and high water mark values 
 explicitly using total values (e.g. low water mark at 120MB, high water mark 
 at 140MB)?  Or shall we keep for simplicity the single setRAMBufferSizeMB() 
 config method and use something like 90% and 110% for the water marks?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2573) Tiered flushing of DWPTs by RAM with low/high water marks

2010-09-07 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2573:
-

Attachment: LUCENE-2573.patch

* perDocAllocator is removed from DocumentsWriterRAMAllocator

* getByteBlock and getIntBlock always increments the numBytesUsed

The test that simply prints out debugging messages looks better.  I need to 
figure out unit tests.

 Tiered flushing of DWPTs by RAM with low/high water marks
 -

 Key: LUCENE-2573
 URL: https://issues.apache.org/jira/browse/LUCENE-2573
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2573.patch, LUCENE-2573.patch


 Now that we have DocumentsWriterPerThreads we need to track total consumed 
 RAM across all DWPTs.
 A flushing strategy idea that was discussed in LUCENE-2324 was to use a 
 tiered approach:  
 - Flush the first DWPT at a low water mark (e.g. at 90% of allowed RAM)
 - Flush all DWPTs at a high water mark (e.g. at 110%)
 - Use linear steps in between high and low watermark:  E.g. when 5 DWPTs are 
 used, flush at 90%, 95%, 100%, 105% and 110%.
 Should we allow the user to configure the low and high water mark values 
 explicitly using total values (e.g. low water mark at 120MB, high water mark 
 at 140MB)?  Or shall we keep for simplicity the single setRAMBufferSizeMB() 
 config method and use something like 90% and 110% for the water marks?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2573) Tiered flushing of DWPTs by RAM with low/high water marks

2010-09-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906918#action_12906918
 ] 

Jason Rutherglen commented on LUCENE-2573:
--

The last patch also only flushes a DWPT if it's the highest RAM consumer.

 Tiered flushing of DWPTs by RAM with low/high water marks
 -

 Key: LUCENE-2573
 URL: https://issues.apache.org/jira/browse/LUCENE-2573
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2573.patch, LUCENE-2573.patch


 Now that we have DocumentsWriterPerThreads we need to track total consumed 
 RAM across all DWPTs.
 A flushing strategy idea that was discussed in LUCENE-2324 was to use a 
 tiered approach:  
 - Flush the first DWPT at a low water mark (e.g. at 90% of allowed RAM)
 - Flush all DWPTs at a high water mark (e.g. at 110%)
 - Use linear steps in between high and low watermark:  E.g. when 5 DWPTs are 
 used, flush at 90%, 95%, 100%, 105% and 110%.
 Should we allow the user to configure the low and high water mark values 
 explicitly using total values (e.g. low water mark at 120MB, high water mark 
 at 140MB)?  Or shall we keep for simplicity the single setRAMBufferSizeMB() 
 config method and use something like 90% and 110% for the water marks?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2573) Tiered flushing of DWPTs by RAM with low/high water marks

2010-09-07 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-2573:
-

Attachment: LUCENE-2573.patch

There was a small bug in the choice of the max DWPT, in that all DWPTs, 
including ones that were scheduled to flush were being compared against the 
current DWPT (ie the one being examined for possible flushing).

 Tiered flushing of DWPTs by RAM with low/high water marks
 -

 Key: LUCENE-2573
 URL: https://issues.apache.org/jira/browse/LUCENE-2573
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch


 Now that we have DocumentsWriterPerThreads we need to track total consumed 
 RAM across all DWPTs.
 A flushing strategy idea that was discussed in LUCENE-2324 was to use a 
 tiered approach:  
 - Flush the first DWPT at a low water mark (e.g. at 90% of allowed RAM)
 - Flush all DWPTs at a high water mark (e.g. at 110%)
 - Use linear steps in between high and low watermark:  E.g. when 5 DWPTs are 
 used, flush at 90%, 95%, 100%, 105% and 110%.
 Should we allow the user to configure the low and high water mark values 
 explicitly using total values (e.g. low water mark at 120MB, high water mark 
 at 140MB)?  Or shall we keep for simplicity the single setRAMBufferSizeMB() 
 config method and use something like 90% and 110% for the water marks?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



<    1   2   3   4   5   6   7   8   9   10   >