Re: BatchScanner behavior with AccumuloRowInputFormat

Christopher Thu, 01 Dec 2016 14:19:48 -0800

The benefit of using a BatchScanner in the AccumuloRowInputFormat is that
it can fetch multiple ranges in parallel within each Mapper. This may be
able to help you manage your MapReduce job resources a bit better (see the
discussion in the JIRA issue for details). If you don't need to use it, I
wouldn't use that option. If you have to use it because of performance
issues, then you can mitigate the row-splitting problem using the
WholeRowIterator, but that will come with its own performance implications.
You might also be able to mitigate by resolving the
single-row-represented-as-multiple-rows problem with a Combiner or in your
Reducer.


On Thu, Dec 1, 2016 at 1:51 AM Massimilian Mattetti <[email protected]>
wrote:

> I see, so the only solution here would be either to use a WholeRowIterator
> or to avoid enabling the BatchScanner. Since each executor will work on a
> single tablet I guess that the benefit of using a BatchScanner is that it
> can fetch multiple ranges over the same tablet in parallel, am I correct?
> Thanks,
> Max
>
>
>
>
> From:        Christopher <[email protected]>
> To:        [email protected]
> Date:        30/11/2016 18:48
> Subject:        Re: BatchScanner behavior with AccumuloRowInputFormat
> ------------------------------
>
>
>
> You'd only have to worry about this behavior if you set
> RowInputFormat.setBatchScan(job, true), available since 1.7.0.
> By default, our InputFormats use a regular Accumulo Scanner.
>
> See *https://issues.apache.org/jira/browse/ACCUMULO-3602*
> <https://issues.apache.org/jira/browse/ACCUMULO-3602> and
> *https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)*
> <https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)>
>
>
> On Wed, Nov 30, 2016 at 9:42 AM Massimilian Mattetti <
> *[email protected]* <[email protected]>> wrote:
> Hi all,
>
> as you already know, the AccumuloRowInputFormat is internally using a
> RowIterator for iterating over all the key value pairs of a single row. In
> the past when I was using the RowIterator together with a BatchScanner I
> had the problem of a single row be split into multiple rows due to the fact
> that a BatchScanner can interleave key-value pairs of different rows.
> Should I expect the same behavior when using the AccumuloRowInputFormat
> with a BatchScanner (enabled via setBatchScan)?
> Thanks,
> Max
>
>
>
>

Re: BatchScanner behavior with AccumuloRowInputFormat

Reply via email to