Hi, William, Thanks very much for your response. I get that it's not supported or desirable for an Iterator to instantiate a scanner or writer. It's sort of analogous to opening a JDBC connection from inside a stored procedure - lots of reasons why that would be a bad idea. I'm more interested in the case where an iterator that processes input A, B, C, D might emit values A, A1=f(A), B, B1=f(B) etc. Under what conditions is it safe to use iterators this way? It seems there are at least two constraints: A1 must sort lexicographically between A and B (otherwise the iterator could emit data out of order), and A1 must be in the same row as A (otherwise A1 might properly be handled by a different tablet server).
Seems like the consensus is to use MR for this sort of thing. I'm definitely keeping an eye on fluo though, looks like a very cool project! -Russ On Sat, Aug 30, 2014 at 12:20 AM, William Slacum < [email protected]> wrote: > This comes up a bit, so maybe we should add it to the FAQ (or just have > better information about iterators in general). The short answer is that > it's usually not recommended, because there aren't strong guarantees about > the lifetime of an iterator (so we wouldn't know when to close any > resources held by an iterator instance, such as batch writer thread pools) > and there's 0 resource management related to tablet server-to-tablet server > communications. > > Check out Fluo, made by our own "Chief" Keith Turner & Mike "The Trike" > Walch: https://github.com/fluo-io/fluo > > It's an implementation of Google's percolator, which provides the > capability to handle "new" data server side as well as transactional > guarantees. > > > On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks <[email protected]> > wrote: > >> There are plenty of examples of using custom iterators to filter or >> combine data at either the cell level or the row level. In these cases, the >> amount of data coming out of the iterator is less than the amount going in. >> What about going the other direction, using a custom iterator to generate >> new data based on the contents of a cell or a row? I guess this is also >> what a combiner does but bear with me... >> >> The immediately obvious use case is parsing. Suppose one cell in my row >> holds an XML document. I'd like to configure an iterator with an XPath >> expression to pull a field out of the document, so that I can leverage the >> distributed processing of the cluster instead of parsing the doc on the >> scanner-side. >> >> I'm sure there are constraints or things to watch out for, does anybody >> have any recommendations here? For instance, the generated cells would >> probably have to be in the same row as the input cells? >> >> I'm using MapReduce to satisfy all these use cases right now but I'm >> interested to know how much of my code could be ported to Iterators. >> >> Thanks! >> -Russ >> > >
