For clarification, are you trying to create DAO's from the Key/Value fed to a Mapper by AccumuloInputFormat, or are you trying to process a different data set while simultaneously querying your DAO's?
> Today I had a really nice conversation with billie and vines on #accumulo. > This email is a followup to that conversation, and there's a little more > context of my problem here. > > We have an application that we've developed independently from MapReduce. > To > get away from the low-level keys and values of Accumulo, we quickly made a > series of DAOs that each take in an Accumulo Instance as a constructor > argument. These DAOs internally create the necessary scanners and return > domain-specific objects. I imagine this is a common practice. > > Now, we've got a feature that needs to operate on all the data, so we're > doing > some MapReduce. I think I understand now the architecture of > AccumuloInputFormat from discussions on #accumulo. What I didn't discuss > was > whether it was reasonable (or not reasonable because of the performance > cost) > to try to use one of our DAOs within a mapper. > > The mappers need to operate per row, and our system has potentially > billions of > rows. With my DAOs, I can reuse the same Accumulo instance, but each call > will create a new scanner from my instance, so a MapReduce job using a DAO > in > the mappers will potentially create billions of scanners over the course > of > operation. However, the way we've designed these DAOs, it's easy to make > sure > all accesses are tied to the row the mapper is tasked with (in an attempt > to > maintain data locality). > > By comparison. I feel the AccumuloInputFormat will create about as many > Accumulo scanners as there are tablet servers, so dramatically less. > > Our current thinking is that creating billions scanners with these DAO > accesses > might cost too much in performance, but we're not completely sure this is > the > case with respect to the kind of caching Accumulo does with its clients. > > If the performance cost is indeed too high, then we're going to have to > deal > with the abstraction challenge of trying to avoid code duplication between > our > DAOs and our MapReduce jobs.
