Apologies for the delay in replying, and thanks for all the responses - this is a very supportive mailing list.
Keith, I've read the paragraph in the javadoc for SKVI that you linked several times, and I still think it doesn't explicitly say that the key that is *returned* is relevant. I interpret the range (b,c) as being a subset of the original range queried for (a,c). The relevant thing is that if the range (a,c) is seeked to by the client, and range (a,b] is processed with keys [x,y] emitted,and then a reseek happens, then the reseek happens to (y,...), which to me is unintuitive. I'd have expected the reseek to happen to the next key after the last key that was read, i.e. (b,...). That paragraph could be amended by adding something like the following sentence: "In fact the reseek happens to the key after the last key that was emitted. This should be considered when creating iterators that modify the key before emitting it." However, the new chapter on iterators in the manual does clarify things nicely, particularly the lines "Iterators should not return any Keys that fall outside of the provided Range" and "Best practice is to never emit entries outside the seek range" - I was doing that in my iterator, and it wasn't obvious to me from previous documentation that I shouldn't do that. That chapter also makes it clear that the reseek happens to the last emitted key (in the code example under the "obtuse re-seek case..." sentence) and in "Specifically, the new Range is created from the original but is shortened by setting the startKey of the original Range to the Key last returned by the Scan, non-inclusive". So, now that we've got the 1.7.0 manual with the detailed chapter on iterators, I think we're in good shape. Thanks again, Dave. On 18 May 2015 at 23:47, Keith Turner <[email protected]> wrote: > > > On Mon, May 18, 2015 at 6:31 PM, Keith Turner <[email protected]> wrote: > >> >> >> On Sat, May 16, 2015 at 3:27 AM, Dave Hardcastle < >> [email protected]> wrote: >> >>> A couple of follow-up questions... >>> >>> So, is it true to say that a filtering iterator that is filtering out a >>> high percentage of the key-values in a range, might have to redo a lot of >>> work if a reseek happens? (It's reseeked to the last emitted key, but a lot >>> of key-values past that may already have been rejected by the filter.) >>> >> >> This may happen, it depends on what the tserver is doing. Lets assume a >> call to next on the iterator advances to the next top key, and not past >> it. If the tserver calls next after the buffer is full, then what you >> described could happen. >> >> So if the tserver is doing something like the following, I think it would >> redo work. Need to investigate this. >> > > I investigated. Sorry for the spam, should have done this before > sending. > > The following code services batch scans. Seems like it checks if the > buffer is full before calling next. > > > https://github.com/apache/accumulo/blob/1.6.2/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L1538 > > The following code services scans, its seems to also check if the buffer > is full before calling next. > > > https://github.com/apache/accumulo/blob/1.6.2/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L1684 > > >> >> iter = .... >> iter.seek(...) >> >> while(iter.hasTop() && !buffer.ifFull()){ >> buffer.add(iter.getTopKey(), iter.getTopValue()) >> iter.next() //if this call to next is made even when buffer is full, >> it could redo work >> } >> >> return buffer; //will reseek with last key (non-inclusive) in buffer >> later. >> >> >> >>> >>> Would it be worth making the fact the the reseek happens to the last >>> emitted key explicit in the documentation? It seems natural to me to assume >>> that the reseek happens to one key past the last read key. I don't think >>> the javadoc for the seek() method in SortedKeyValueIterator makes it quite >>> clear enough. >>> >> >> When the reseek is done using the last key returned, it makes it >> non-inclusive. What are your thoughts on the following paragraph? >> >> >> https://github.com/apache/accumulo/blob/1.6.2/core/src/main/java/org/apache/accumulo/core/iterators/SortedKeyValueIterator.java#L81 >> >> >>> >>> Thanks, >>> >>> Dave. >>> >>> On 15 May 2015 at 19:32, Eric Newton <[email protected]> wrote: >>> >>>> is it the same instance of the iterator object >>>> >>>> >>>> No, it is not. >>>> >>>> On Fri, May 15, 2015 at 2:16 PM, Dave Hardcastle < >>>> [email protected]> wrote: >>>> >>>>> Jim, >>>>> >>>>> That explains a lot - I knew that the iterator stack could be resumed >>>>> in the middle of a range, but didn't realise that it used the last emitted >>>>> key to decide where to resume. >>>>> >>>>> Just so I'm clear, when iterators get stopped and later resumed, is it >>>>> the same instance of the iterator object that's restarted (so that I could >>>>> store state in there and use that to help the reseek) or is it a new >>>>> instance of the iterator that has to be able to resume purely on the basis >>>>> of the last emitted key? >>>>> >>>>> As you say though, it's probably best to stick to modifying values >>>>> only. >>>>> >>>>> Thanks very much, >>>>> >>>>> Dave. >>>>> >>>>> On 15 May 2015 at 18:55, James Hughes <[email protected]> wrote: >>>>> >>>>>> Hi Dave, >>>>>> >>>>>> The big thing to note is that your iterator stack may get stopped and >>>>>> torn down for various reasons. As Accumulo recreates the stack, it will >>>>>> call 'seek' with the last emitted key in order to resume. >>>>>> >>>>>> If you are returning keys out of order in an iterator, the 'seek' >>>>>> method needs to be able to undo the transformation and call 'seek' >>>>>> appropriately. That's not impossible, but it isn't trivial. >>>>>> >>>>>> In GeoMesa, we did something like that at one point (without having a >>>>>> smart 'seek'). I enjoyed two days of debugging trying to figure out why >>>>>> medium sized requests would hang. (There was an infinite loop....) From >>>>>> that experience, I'd suggest only modifying values. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> On Fri, May 15, 2015 at 1:26 PM, Dave Hardcastle < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I've always assumed that the last iterator in the stack can make >>>>>>> arbitrary changes to keys and values, including not returning the keys >>>>>>> in >>>>>>> sorted order. I know that SortedKeyValueIterator says that "anything >>>>>>> implementing this interface should return keys in sorted order" - but I >>>>>>> don't see a good reason that has to be true for the final iterator. This >>>>>>> assumption seems to be backed up by the manual which says that "the only >>>>>>> safe way to generate additional data in an iterator is to alter the >>>>>>> current >>>>>>> key-value pair" - it doesn't say that making arbitrary modifications to >>>>>>> the >>>>>>> rowkey or key is forbidden. >>>>>>> >>>>>>> I have a situation where I am making a transformation of the rowkey >>>>>>> that may not preserve the ordering of the keys. When I scan for >>>>>>> individual >>>>>>> ranges I get the correct results. When I scan for two ranges using a >>>>>>> BatchScanner, I get lots of data back which is not in the ranges I >>>>>>> queried >>>>>>> for. I am not explicitly checking that I have not gone beyond the range, >>>>>>> but that should not be necessary as I am not doing any seeking, only >>>>>>> consuming the key-values I receive. >>>>>>> >>>>>>> So, my main question is whether the last iterator is allowed to not >>>>>>> return keys in sorted order? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Dave. >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
