Re: BatchScanner behavior with AccumuloRowInputFormat

2016-12-01 Thread Christopher
The benefit of using a BatchScanner in the AccumuloRowInputFormat is that
it can fetch multiple ranges in parallel within each Mapper. This may be
able to help you manage your MapReduce job resources a bit better (see the
discussion in the JIRA issue for details). If you don't need to use it, I
wouldn't use that option. If you have to use it because of performance
issues, then you can mitigate the row-splitting problem using the
WholeRowIterator, but that will come with its own performance implications.
You might also be able to mitigate by resolving the
single-row-represented-as-multiple-rows problem with a Combiner or in your
Reducer.

On Thu, Dec 1, 2016 at 1:51 AM Massimilian Mattetti <massi...@il.ibm.com>
wrote:

> I see, so the only solution here would be either to use a WholeRowIterator
> or to avoid enabling the BatchScanner. Since each executor will work on a
> single tablet I guess that the benefit of using a BatchScanner is that it
> can fetch multiple ranges over the same tablet in parallel, am I correct?
> Thanks,
> Max
>
>
>
>
> From:Christopher <ctubb...@apache.org>
> To:user@accumulo.apache.org
> Date:30/11/2016 18:48
> Subject:    Re: BatchScanner behavior with AccumuloRowInputFormat
> --
>
>
>
> You'd only have to worry about this behavior if you set
> RowInputFormat.setBatchScan(job, true), available since 1.7.0.
> By default, our InputFormats use a regular Accumulo Scanner.
>
> See *https://issues.apache.org/jira/browse/ACCUMULO-3602*
> <https://issues.apache.org/jira/browse/ACCUMULO-3602> and
> *https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)*
> <https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)>
>
>
> On Wed, Nov 30, 2016 at 9:42 AM Massimilian Mattetti <
> *massi...@il.ibm.com* <massi...@il.ibm.com>> wrote:
> Hi all,
>
> as you already know, the AccumuloRowInputFormat is internally using a
> RowIterator for iterating over all the key value pairs of a single row. In
> the past when I was using the RowIterator together with a BatchScanner I
> had the problem of a single row be split into multiple rows due to the fact
> that a BatchScanner can interleave key-value pairs of different rows.
> Should I expect the same behavior when using the AccumuloRowInputFormat
> with a BatchScanner (enabled via setBatchScan)?
> Thanks,
> Max
>
>
>
>


Re: BatchScanner behavior with AccumuloRowInputFormat

2016-11-30 Thread Massimilian Mattetti
I see, so the only solution here would be either to use a WholeRowIterator 
or to avoid enabling the BatchScanner. Since each executor will work on a 
single tablet I guess that the benefit of using a BatchScanner is that it 
can fetch multiple ranges over the same tablet in parallel, am I correct? 
Thanks,
Max




From:   Christopher <ctubb...@apache.org>
To: user@accumulo.apache.org
Date:   30/11/2016 18:48
Subject:        Re: BatchScanner behavior with AccumuloRowInputFormat



You'd only have to worry about this behavior if you set 
RowInputFormat.setBatchScan(job, true), available since 1.7.0.
By default, our InputFormats use a regular Accumulo Scanner.

See https://issues.apache.org/jira/browse/ACCUMULO-3602 and 
https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)


On Wed, Nov 30, 2016 at 9:42 AM Massimilian Mattetti <massi...@il.ibm.com> 
wrote:
Hi all,

as you already know, the AccumuloRowInputFormat is internally using a 
RowIterator for iterating over all the key value pairs of a single row. In 
the past when I was using the RowIterator together with a BatchScanner I 
had the problem of a single row be split into multiple rows due to the 
fact that a BatchScanner can interleave key-value pairs of different rows. 
Should I expect the same behavior when using the AccumuloRowInputFormat 
with a BatchScanner (enabled via setBatchScan)?
Thanks,
Max






Re: BatchScanner behavior with AccumuloRowInputFormat

2016-11-30 Thread Christopher
You'd only have to worry about this behavior if you set
RowInputFormat.setBatchScan(job, true), available since 1.7.0.
By default, our InputFormats use a regular Accumulo Scanner.

See https://issues.apache.org/jira/browse/ACCUMULO-3602 and
https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)


On Wed, Nov 30, 2016 at 9:42 AM Massimilian Mattetti 
wrote:

Hi all,

as you already know, the AccumuloRowInputFormat is internally using a
RowIterator for iterating over all the key value pairs of a single row. In
the past when I was using the RowIterator together with a BatchScanner I
had the problem of a single row be split into multiple rows due to the fact
that a BatchScanner can interleave key-value pairs of different rows.
Should I expect the same behavior when using the AccumuloRowInputFormat
with a BatchScanner (enabled via setBatchScan)?
Thanks,
Max