The benefit of using a BatchScanner in the AccumuloRowInputFormat is that it can fetch multiple ranges in parallel within each Mapper. This may be able to help you manage your MapReduce job resources a bit better (see the discussion in the JIRA issue for details). If you don't need to use it, I wouldn't use that option. If you have to use it because of performance issues, then you can mitigate the row-splitting problem using the WholeRowIterator, but that will come with its own performance implications. You might also be able to mitigate by resolving the single-row-represented-as-multiple-rows problem with a Combiner or in your Reducer.
On Thu, Dec 1, 2016 at 1:51 AM Massimilian Mattetti <massi...@il.ibm.com> wrote: > I see, so the only solution here would be either to use a WholeRowIterator > or to avoid enabling the BatchScanner. Since each executor will work on a > single tablet I guess that the benefit of using a BatchScanner is that it > can fetch multiple ranges over the same tablet in parallel, am I correct? > Thanks, > Max > > > > > From: Christopher <ctubb...@apache.org> > To: email@example.com > Date: 30/11/2016 18:48 > Subject: Re: BatchScanner behavior with AccumuloRowInputFormat > ------------------------------ > > > > You'd only have to worry about this behavior if you set > RowInputFormat.setBatchScan(job, true), available since 1.7.0. > By default, our InputFormats use a regular Accumulo Scanner. > > See *https://issues.apache.org/jira/browse/ACCUMULO-3602* > <https://issues.apache.org/jira/browse/ACCUMULO-3602> and > *https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)* > <https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)> > > > On Wed, Nov 30, 2016 at 9:42 AM Massimilian Mattetti < > *massi...@il.ibm.com* <massi...@il.ibm.com>> wrote: > Hi all, > > as you already know, the AccumuloRowInputFormat is internally using a > RowIterator for iterating over all the key value pairs of a single row. In > the past when I was using the RowIterator together with a BatchScanner I > had the problem of a single row be split into multiple rows due to the fact > that a BatchScanner can interleave key-value pairs of different rows. > Should I expect the same behavior when using the AccumuloRowInputFormat > with a BatchScanner (enabled via setBatchScan)? > Thanks, > Max > > > >