Thanks James. I asked about the filtering example just to check my understanding was right, but I agree it's probably a corner case.
Re the documentation - I don't think the problem is not conforming to the sorted key part. If you had row keys which were integers in increasing order, and in the iterator added a million to each row key and emitted that then you'd still get problems if there was a reseek (assuming that adding a million took you out of the range). Admittedly I can't see why you'd do that, but I'd read the javadoc, the manual and the Accumulo book carefully and I hadn't picked up that the actual key that is emitted is relevant to the reseek issue. BTW, none of this is meant to reflect badly on the iterator stack - they're really powerful and are one of Accumulo's main selling points. Dave. On 16 May 2015 at 14:55, James Hughes <[email protected]> wrote: > Hi Dave, > > I can speak to the first question a little bit. The one time I saw this, > I traced the code and saw that after emitting a certain number of bytes, > the iterator stack was recreated. In that case, no further keys would have > been filtered since the current key-value pair being emitted would trigger > the reset and that key would be used for the re-seek. I'll apply all > caveats to that explanation: it was Accumulo 1.4 and didn't learn about why > the stack was stopped and recreated or other times that may happen. > > On the other hand, one could imagine a tablet server dying in the middle > of returning entries. I have no idea of the details of how Accumulo > handles that. Worst case, you may be right about some reprocessing, but > all this sounds like a corner case. > > For the documentation, writing about implementation details directly may > not be the best way. I'd hope that the documentation would make it clear > that all iterators (even presumed 'top' or 'final' iterators) should > conform to the 'sorted key' part of the contract. > > Thanks, > > Jim > > > On Sat, May 16, 2015 at 3:27 AM, Dave Hardcastle < > [email protected]> wrote: > >> A couple of follow-up questions... >> >> So, is it true to say that a filtering iterator that is filtering out a >> high percentage of the key-values in a range, might have to redo a lot of >> work if a reseek happens? (It's reseeked to the last emitted key, but a lot >> of key-values past that may already have been rejected by the filter.) >> >> Would it be worth making the fact the the reseek happens to the last >> emitted key explicit in the documentation? It seems natural to me to assume >> that the reseek happens to one key past the last read key. I don't think >> the javadoc for the seek() method in SortedKeyValueIterator makes it quite >> clear enough. >> >> Thanks, >> >> Dave. >> >> On 15 May 2015 at 19:32, Eric Newton <[email protected]> wrote: >> >>> is it the same instance of the iterator object >>> >>> >>> No, it is not. >>> >>> On Fri, May 15, 2015 at 2:16 PM, Dave Hardcastle < >>> [email protected]> wrote: >>> >>>> Jim, >>>> >>>> That explains a lot - I knew that the iterator stack could be resumed >>>> in the middle of a range, but didn't realise that it used the last emitted >>>> key to decide where to resume. >>>> >>>> Just so I'm clear, when iterators get stopped and later resumed, is it >>>> the same instance of the iterator object that's restarted (so that I could >>>> store state in there and use that to help the reseek) or is it a new >>>> instance of the iterator that has to be able to resume purely on the basis >>>> of the last emitted key? >>>> >>>> As you say though, it's probably best to stick to modifying values only. >>>> >>>> Thanks very much, >>>> >>>> Dave. >>>> >>>> On 15 May 2015 at 18:55, James Hughes <[email protected]> wrote: >>>> >>>>> Hi Dave, >>>>> >>>>> The big thing to note is that your iterator stack may get stopped and >>>>> torn down for various reasons. As Accumulo recreates the stack, it will >>>>> call 'seek' with the last emitted key in order to resume. >>>>> >>>>> If you are returning keys out of order in an iterator, the 'seek' >>>>> method needs to be able to undo the transformation and call 'seek' >>>>> appropriately. That's not impossible, but it isn't trivial. >>>>> >>>>> In GeoMesa, we did something like that at one point (without having a >>>>> smart 'seek'). I enjoyed two days of debugging trying to figure out why >>>>> medium sized requests would hang. (There was an infinite loop....) From >>>>> that experience, I'd suggest only modifying values. >>>>> >>>>> Cheers, >>>>> >>>>> Jim >>>>> >>>>> >>>>> On Fri, May 15, 2015 at 1:26 PM, Dave Hardcastle < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I've always assumed that the last iterator in the stack can make >>>>>> arbitrary changes to keys and values, including not returning the keys in >>>>>> sorted order. I know that SortedKeyValueIterator says that "anything >>>>>> implementing this interface should return keys in sorted order" - but I >>>>>> don't see a good reason that has to be true for the final iterator. This >>>>>> assumption seems to be backed up by the manual which says that "the only >>>>>> safe way to generate additional data in an iterator is to alter the >>>>>> current >>>>>> key-value pair" - it doesn't say that making arbitrary modifications to >>>>>> the >>>>>> rowkey or key is forbidden. >>>>>> >>>>>> I have a situation where I am making a transformation of the rowkey >>>>>> that may not preserve the ordering of the keys. When I scan for >>>>>> individual >>>>>> ranges I get the correct results. When I scan for two ranges using a >>>>>> BatchScanner, I get lots of data back which is not in the ranges I >>>>>> queried >>>>>> for. I am not explicitly checking that I have not gone beyond the range, >>>>>> but that should not be necessary as I am not doing any seeking, only >>>>>> consuming the key-values I receive. >>>>>> >>>>>> So, my main question is whether the last iterator is allowed to not >>>>>> return keys in sorted order? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Dave. >>>>>> >>>>> >>>>> >>>> >>> >> >
