Ah I see. You're correct about the ordering. How different would your key be? Another thing to consider is that if you are returning a generated key that's not actually in the data, your iterator needs to handle the case where it is reseek'd with a range that has an exclusive start on a generated key. You'd have to potentially recompute results if you return multiple generated keys.
On Tue, Sep 2, 2014 at 1:01 AM, Russ Weeks <[email protected]> wrote: > Hi, William, > > Thanks very much for your response. I get that it's not supported or > desirable for an Iterator to instantiate a scanner or writer. It's sort of > analogous to opening a JDBC connection from inside a stored procedure - > lots of reasons why that would be a bad idea. I'm more interested in the > case where an iterator that processes input A, B, C, D might emit values A, > A1=f(A), B, B1=f(B) etc. Under what conditions is it safe to use iterators > this way? It seems there are at least two constraints: A1 must sort > lexicographically between A and B (otherwise the iterator could emit data > out of order), and A1 must be in the same row as A (otherwise A1 might > properly be handled by a different tablet server). > > Seems like the consensus is to use MR for this sort of thing. I'm > definitely keeping an eye on fluo though, looks like a very cool project! > > -Russ > > > On Sat, Aug 30, 2014 at 12:20 AM, William Slacum < > [email protected]> wrote: > >> This comes up a bit, so maybe we should add it to the FAQ (or just have >> better information about iterators in general). The short answer is that >> it's usually not recommended, because there aren't strong guarantees about >> the lifetime of an iterator (so we wouldn't know when to close any >> resources held by an iterator instance, such as batch writer thread pools) >> and there's 0 resource management related to tablet server-to-tablet server >> communications. >> >> Check out Fluo, made by our own "Chief" Keith Turner & Mike "The Trike" >> Walch: https://github.com/fluo-io/fluo >> >> It's an implementation of Google's percolator, which provides the >> capability to handle "new" data server side as well as transactional >> guarantees. >> >> >> On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks <[email protected]> >> wrote: >> >>> There are plenty of examples of using custom iterators to filter or >>> combine data at either the cell level or the row level. In these cases, the >>> amount of data coming out of the iterator is less than the amount going in. >>> What about going the other direction, using a custom iterator to generate >>> new data based on the contents of a cell or a row? I guess this is also >>> what a combiner does but bear with me... >>> >>> The immediately obvious use case is parsing. Suppose one cell in my row >>> holds an XML document. I'd like to configure an iterator with an XPath >>> expression to pull a field out of the document, so that I can leverage the >>> distributed processing of the cluster instead of parsing the doc on the >>> scanner-side. >>> >>> I'm sure there are constraints or things to watch out for, does anybody >>> have any recommendations here? For instance, the generated cells would >>> probably have to be in the same row as the input cells? >>> >>> I'm using MapReduce to satisfy all these use cases right now but I'm >>> interested to know how much of my code could be ported to Iterators. >>> >>> Thanks! >>> -Russ >>> >> >> >
