[jira] [Updated] (ACCUMULO-4629) Seeking in timestamp range is slow

2018-08-09 Thread Christopher Tubbs (JIRA)


 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Tubbs updated ACCUMULO-4629:

Fix Version/s: (was: 2.0.0)

> Seeking in timestamp range is slow
> --
>
> Key: ACCUMULO-4629
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4629
> Project: Accumulo
>  Issue Type: Bug
>Reporter: Keith Turner
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Fluo's internal schema uses the first 4 bits of the timestamp to store 
> different types of information per column.  These first 4 bits divide the 
> timestamp space up into 16 ranges.  Fluo has server side iterators that 
> consider information in one of these 16 ranges and then seek forward to 
> another of the 16 ranges.
> Unfortunately, Accumulo's built in iterator that processes delete marker 
> makes seeking within the timestamps of a column an O(N^2) operation.  This is 
> because of the way deletes work in Accumulo.  A delete marker for timestamp X 
> in Accumulo deletes anything with a timestamp <= X.  
> When seeking to timestamp Y, the Accumulo iterator that handles deletes will 
> scan from MAX_LONG to Y looking for any deletes that may keep you from seeing 
> data at timestamp Y.  The following example shows what the delete iterator 
> will do when a user iterator does some seeks.
>  * User iterator seeks to stamp 1,000,000.  This causes the delete iter to 
> scan from MAX_LONG to 1,000,000 looking for delete markers.
>  * User iterator seeks to stamp 900,000.  This causes the delete iter to scan 
> from MAX_LONG to 900,000 looking for delete markers.
>  * User iterator seeks to stamp 500,000.  This causes the delete iter to scan 
> from MAX_LONG to 500,000 looking for delete markers.
> So Fluo can seek efficiently, it has done some [serious 
> shenanigans|https://github.com/apache/incubator-fluo/blob/rel/fluo-1.0.0-incubating/modules/accumulo/src/main/java/org/apache/fluo/accumulo/iterators/TimestampSkippingIterator.java#L164]
>  using reflection to remove  the DeleteIterator.  The great work being done 
> on ACCUMULO-3079 will likely break this crazy reflection code.  So I would 
> like to make a change to upstream Accumulo that allows efficient seeking in 
> the timestamp range.  I have thought of the following two possible solutions.
>  * Make the DeleteIterator stateful so that it remember what ranges it has 
> scanned for deletes.  I don't really like this solution because it will add 
> expense to every seek in Accumulo for an edge case.
>  * Make it possible to create tables with an exact delete behavior. Meaning a 
> delete for timestamp X will only delete an existing row column with that 
> exact timestamp.  This option could only be chosen at table creation time and 
> could not be changed.  For this delete behavior, the delete iterator doesnot 
> need to scan for every seek.
> Are there other possible solutions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ACCUMULO-4629) Seeking in timestamp range is slow

2018-08-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ACCUMULO-4629:
-
Labels: pull-request-available  (was: )

> Seeking in timestamp range is slow
> --
>
> Key: ACCUMULO-4629
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4629
> Project: Accumulo
>  Issue Type: Bug
>Reporter: Keith Turner
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Fluo's internal schema uses the first 4 bits of the timestamp to store 
> different types of information per column.  These first 4 bits divide the 
> timestamp space up into 16 ranges.  Fluo has server side iterators that 
> consider information in one of these 16 ranges and then seek forward to 
> another of the 16 ranges.
> Unfortunately, Accumulo's built in iterator that processes delete marker 
> makes seeking within the timestamps of a column an O(N^2) operation.  This is 
> because of the way deletes work in Accumulo.  A delete marker for timestamp X 
> in Accumulo deletes anything with a timestamp <= X.  
> When seeking to timestamp Y, the Accumulo iterator that handles deletes will 
> scan from MAX_LONG to Y looking for any deletes that may keep you from seeing 
> data at timestamp Y.  The following example shows what the delete iterator 
> will do when a user iterator does some seeks.
>  * User iterator seeks to stamp 1,000,000.  This causes the delete iter to 
> scan from MAX_LONG to 1,000,000 looking for delete markers.
>  * User iterator seeks to stamp 900,000.  This causes the delete iter to scan 
> from MAX_LONG to 900,000 looking for delete markers.
>  * User iterator seeks to stamp 500,000.  This causes the delete iter to scan 
> from MAX_LONG to 500,000 looking for delete markers.
> So Fluo can seek efficiently, it has done some [serious 
> shenanigans|https://github.com/apache/incubator-fluo/blob/rel/fluo-1.0.0-incubating/modules/accumulo/src/main/java/org/apache/fluo/accumulo/iterators/TimestampSkippingIterator.java#L164]
>  using reflection to remove  the DeleteIterator.  The great work being done 
> on ACCUMULO-3079 will likely break this crazy reflection code.  So I would 
> like to make a change to upstream Accumulo that allows efficient seeking in 
> the timestamp range.  I have thought of the following two possible solutions.
>  * Make the DeleteIterator stateful so that it remember what ranges it has 
> scanned for deletes.  I don't really like this solution because it will add 
> expense to every seek in Accumulo for an edge case.
>  * Make it possible to create tables with an exact delete behavior. Meaning a 
> delete for timestamp X will only delete an existing row column with that 
> exact timestamp.  This option could only be chosen at table creation time and 
> could not be changed.  For this delete behavior, the delete iterator doesnot 
> need to scan for every seek.
> Are there other possible solutions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ACCUMULO-4629) Seeking in timestamp range is slow

2017-04-25 Thread Keith Turner (JIRA)

 [ 
https://issues.apache.org/jira/browse/ACCUMULO-4629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Turner updated ACCUMULO-4629:
---
Description: 
Fluo's internal schema uses the first 4 bits of the timestamp to store 
different types of information per column.  These first 4 bits divide the 
timestamp space up into 16 ranges.  Fluo has server side iterators that 
consider information in one of these 16 ranges and then seek forward to another 
of the 16 ranges.

Unfortunately, Accumulo's built in iterator that processes delete marker makes 
seeking within the timestamps of a column an O(N^2) operation.  This is because 
of the way deletes work in Accumulo.  A delete marker for timestamp X in 
Accumulo deletes anything with a timestamp <= X.  

When seeking to timestamp Y, the Accumulo iterator that handles deletes will 
scan from MAX_LONG to Y looking for any deletes that may keep you from seeing 
data at timestamp Y.  The following example shows what the delete iterator will 
do when a user iterator does some seeks.

 * User iterator seeks to stamp 1,000,000.  This causes the delete iter to scan 
from MAX_LONG to 1,000,000 looking for delete markers.
 * User iterator seeks to stamp 900,000.  This causes the delete iter to scan 
from MAX_LONG to 900,000 looking for delete markers.
 * User iterator seeks to stamp 500,000.  This causes the delete iter to scan 
from MAX_LONG to 500,000 looking for delete markers.

So Fluo can seek efficiently, it has done some [serious 
shenanigans|https://github.com/apache/incubator-fluo/blob/rel/fluo-1.0.0-incubating/modules/accumulo/src/main/java/org/apache/fluo/accumulo/iterators/TimestampSkippingIterator.java#L164]
 using reflection to remove  the DeleteIterator.  The great work being done on 
ACCUMULO-3079 will likely break this crazy reflection code.  So I would like to 
make a change to upstream Accumulo that allows efficient seeking in the 
timestamp range.  I have thought of the following two possible solutions.

 * Make the DeleteIterator stateful so that it remember what ranges it has 
scanned for deletes.  I don't really like this solution because it will add 
expense to every seek in Accumulo for an edge case.
 * Make it possible to create tables with an exact delete behavior. Meaning a 
delete for timestamp X will only delete an existing row column with that exact 
timestamp.  This option could only be chosen at table creation time and could 
not be changed.  For this delete behavior, the delete iterator doesnot need to 
scan for every seek.

Are there other possible solutions?




  was:
Fluo's internal schema uses the first 4 bits of the timestamp to store 
different types of information per column.  These first 4 bits divide the 
timestamp space up into 16 ranges.  Fluo has server side iterators that 
consider information in one of these 16 ranges and then seek forward to another 
of the 16 ranges.

Unfortunately, Accumulo's built in iterator that processes delete marker makes 
seeking within the timestamps of a column an O(N^2) operation.  This is because 
of the way deletes work in Accumulo.  A delete marker for timestamp X in 
Accumulo deletes anything with a timestamp <= X.  

When seeking to timestamp Y, the Accumulo iterator that handles deletes will 
scan from MAX_LONG to Y looking for any deletes that may keep you from seeing 
data at timestamp Y.  The following example shows what the delete iterator will 
do when a user iterator does some seeks.

 * User iterator seeks to stamp 1,000,000.  This causes the delete iter to scan 
from MAX_LONG to 1,000,000 looking for delete markers.
 * User iterator seeks to stamp 900,000.  This causes the delete iter to scan 
from MAX_LONG to 900,000 looking for delete markers.
 * User iterator seeks to stamp 500,000.  This causes the delete iter to scan 
from MAX_LONG to 500,000 looking for delete markers.

So Fluo can seek efficiently, it has done some [serious 
shenanigans|https://github.com/apache/incubator-fluo/blob/rel/fluo-1.0.0-incubating/modules/accumulo/src/main/java/org/apache/fluo/accumulo/iterators/TimestampSkippingIterator.java#L164]
 using reflection to remove  the DeleteIterator.  The great work being done on 
ACCUMULO-3079 will likely break this crazy reflection code.  So I would like to 
make a change to upstream Accumulo that allows efficient seeking in the 
timestamp range.  I have thought of the following two possible solutions.

 * Make the DeleteIterator stateful so that it remember what ranges it has 
scanned for deletes.  I don't really like this solution because it will add 
expense to every seek in Accumulo for an edge case.
 * Make it possible to create tables with an exact delete behavior. Meaning a 
delete for timestamp X will only delete an existing row column with that exact 
timestamp.  This option could only be chosen at table creation time and could 
not be changed.  For this delete behavior, the delete