"top-level" with respect to the side channel description is inverted with respect to your diagram. Fig. A should be more like this:
RfileIter1 RfileIter2 | / |_________/ Merge | VersioningIterator | OtherIterators InjectIterator | / |______________/ Merge | v Thus, VersioningIterator and OtherIterators don't see any of the entries coming from InjectIterator. Adam On Mon, Feb 16, 2015 at 1:23 PM, Dylan Hutchison <[email protected]> wrote: > > why you want to use a side channel instead of implementing the merge in >> your own iterator >> > Here is a picture showing the difference-- > > Fig. A: Using a side channel to add a top-level iterator. > > RfileIter1 RfileIter2 InjectIterator ... > | / / > |_________/ / > o__*(3-way merge)*_____/ > > | > > VersioningIterator > | > > OtherIterators > | > v > ... > > > Fig. B: Merging in the data at a later stage > > RfileIter1 RfileIter2 ... > | / > o_________/ > > | > > VersioningIterator > | > | InjectIterator > > o________/ > > | > > OtherIterators > | > v > ... > > (note: we're free to add iterators before the VersioningIterator too) > > Unless the order of iterators matters (e.g., the VersioningIterator > position matters if InjectIterator generates an entry with the same row, > colFamily and colQualifier as an entry in the table), the two styles will > give the same results. > > This has implications on composibility with other iterators, since >> downstream iterators would not see anything sent to the side channel but >> they would see things merged and returned by a MultiIterator. >> > If the iterator is at the top level, then every iterator below it will see > output from the top level iterator. Did you mean composibility with other > iterators added at the top level? If hypothetical iterator > "InjectIterator2" needs to see the results of "InjectIterator", then we > need to place InjectIterator2 below InjectIterator on the hierarchy, > whether in Fig. A or Fig. B. > > For my particular situation, reading from another Accumulo table inside an > iterator, I'm not sure which is better. I like the idea of adding another > data stream as a top-level source, but Fig. B is possible too. > > Regards, > Dylan Hutchison > > > On Mon, Feb 16, 2015 at 11:34 AM, Adam Fuchs <[email protected]> wrote: > >> Dylan, >> >> If I recall correctly (which I give about 30% odds), the original purpose >> of the side channel was to split up things like delete tombstone entries >> from "regular" entries so that other iterators sitting on top of a >> bifurcating iterator wouldn't have to handle the special tombstone >> preservation logic. This worked in theory, but it never really caught on. >> I'm not sure any operational code is calling the registerSideChannel method >> right now, so you're sort of in pioneering territory. That said, this looks >> like it should work as you described it. >> >> Can you describe why you want to use a side channel instead of >> implementing the merge in your own iterator (e.g. subclassing MultiIterator >> and overriding the init method)? This has implications on composibility >> with other iterators, since downstream iterators would not see anything >> sent to the side channel but they would see things merged and returned by a >> MultiIterator. >> >> Adam >> On Feb 16, 2015 3:18 AM, "Dylan Hutchison" <[email protected]> wrote: >> >>> If you can do a merge sort insertion, then you can guarantee order and >>>> it's fine. >>>> >>> Yep, I guarantee the iterator we add as a side channel will emit tuples >>> in sorted order. >>> >>> On a suggestion from David Medinets, I modified my testing code to use a >>> MiniAccumuloCluster set to 2 tablet servers. I then set a table split on >>> "row3" before launching the compaction. The result looks good. Here is >>> output from a run on a local Accumulo instance. Note that we write more >>> values than we read. >>> >>> 2015-02-16 02:44:51,125 [tserver.Tablet] DEBUG: Starting MajC k;row3< >>> (USER) [hdfs://localhost:9000/accumulo/tables/k/t-00000g4/F00000g5.rf] --> >>> hdfs://localhost:9000/accumulo/tables/k/t-00000g4/A00000g7.rf_tmp >>> [name:InjectIterator, priority:15, >>> class:edu.mit.ll.graphulo.InjectIterator, properties:{}] >>> 2015-02-16 02:44:51,127 [tserver.Tablet] DEBUG: Starting MajC k<;row3 >>> (USER) [hdfs://localhost:9000/accumulo/tables/k/default_tablet/F00000g6.rf] >>> --> hdfs://localhost:9000/accumulo/tables/k/default_tablet/A00000g8.rf_tmp >>> [name:InjectIterator, priority:15, >>> class:edu.mit.ll.graphulo.InjectIterator, properties:{}] >>> 2015-02-16 02:44:51,190 [tserver.Compactor] DEBUG: *Compaction k<;row3 >>> 2 read | 4 written* | 111 entries/sec | 0.018 secs >>> 2015-02-16 02:44:51,194 [tserver.Compactor] DEBUG: *Compaction k;row3< >>> 1 read | 4 written* | 43 entries/sec | 0.023 secs >>> >>> >>> In addition, output from the DebugIterator looks as expected. There is >>> a re-seek after reading the first tablet to the key after the last entry >>> returned in the first tablet. >>> >>> DEBUG: >>> init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@15085e63, >>> {}, org.apache.accumulo.tserver.TabletIteratorEnvironment@586cc05e) >>> DEBUG: 0x1C2BFB13 seek((-inf,+inf), [], false) >>> >>> ... <snipped logs> >>> >>> DEBUG: >>> init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@2b048c59, >>> {}, org.apache.accumulo.tserver.TabletIteratorEnvironment@379a3d1f) >>> DEBUG: 0x5946E74B seek([row2 colF3:colQ3 [] 9223372036854775807 >>> false,+inf), [], false) >>> >>> >>> It seems the side channel strategy will hold up. We have opened a new >>> world of Accumulo-foo. Of course, the real test is a multi-node instance >>> with more than 10 entries of data. >>> >>> Regards, Dylan >>> >>> >>> On Sun, Feb 15, 2015 at 11:17 PM, Andrew Wells <[email protected]> >>> wrote: >>> >>>> The main issue with adding data in an iterator is order. If you have >>>> can do a merge sort insertion, then you can guarantee order and its fine. >>>> But if you are inserting base on input you cannot guarantee order, and it >>>> can only be on scan iterator. >>>> On Feb 15, 2015 8:03 PM, "Dylan Hutchison" <[email protected]> >>>> wrote: >>>> >>>>> Hello all, >>>>> >>>>> I've been toying with the registerSideChannel(iter) >>>>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/IteratorEnvironment.html#registerSideChannel(org.apache.accumulo.core.iterators.SortedKeyValueIterator)> >>>>> method >>>>> on the IteratorEnvironment passed to iterators through the init() method. >>>>> From what I can tell, the method allows you to add another iterator as a >>>>> top level source, to be merged in along with other usual top-level sources >>>>> such as the in-memory cache and RFiles. >>>>> >>>>> Are there any downsides to using registerSideChannel( ) to "add new >>>>> data" to an iterator chain? It looks like this is fairly stable, so long >>>>> as the iterator we add as a side channel implements seek() properly so as >>>>> to only return entries whose rows are within a tablet. I imagine it works >>>>> like so: >>>>> >>>>> Suppose we set a custom iterator InjectIterator that registers a side >>>>> channel inside init() at priority 5 as a one-time major compaction >>>>> iterator. InjectIterator forwards other operations to its parent, as in >>>>> WrappingIterator >>>>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/WrappingIterator.html>. >>>>> We start the compaction: >>>>> >>>>> Tablet 1 (a,g] >>>>> >>>>> 1. init() called on InjectIterator. Creates the side channel >>>>> iterator, calls init() on it, and registers it. >>>>> 2. init() called on VersioningIterator. >>>>> 3. init() called on top level iterators, including Rfiles, >>>>> in-memory cache and the new side channel. >>>>> 4. seek( (a,g] ) called on InjectIterator. >>>>> 5. seek( (a,g] ) called on VersioningIterator. >>>>> 6. seek( (a,g] ) called on top level iterators >>>>> 7. next() called on InjectIterator. Forwards to parent. >>>>> 8. next() called on VersioningIterator. Forwards to parent. >>>>> 9. next() called on top level iterator (a MultiIterator >>>>> >>>>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/system/MultiIterator.html>). >>>>> The next value is read from all the top-level iterator sources and the >>>>> one >>>>> with the least key is cached ready to go. >>>>> 10. ... >>>>> >>>>> Tablet 2 (g,p) --- same as tablet 1 except steps 4-6 call seek( (g,p) >>>>> ). Done in parallel with tablet 1 if on a different tablet server. >>>>> >>>>> Is this an accurate depiction? Anything I should treat with caution? >>>>> It seems to work on my single-node instance, so tips about difficulties >>>>> going to multi-node are good. >>>>> >>>>> Code available here. >>>>> <https://github.com/Accla/d4m_api_java/blob/0d8c62164d5c0b59f949ce23c1b85536809764d2/src/main/java/edu/mit/ll/graphulo/InjectIterator.java#L166> >>>>> >>>>> Regards, >>>>> Dylan Hutchison >>>>> >>>>> -- >>>>> www.cs.stevens.edu/~dhutchis >>>>> >>>> >>> >>> >>> -- >>> www.cs.stevens.edu/~dhutchis >>> >> > > > -- > www.cs.stevens.edu/~dhutchis >
