Tim,
Its pretty interesting to read, I once dug in for another user around
here. Check out this archive post:
http://search-hadoop.com/m/cRmJ3gTtN32 - Make sure to also read the
LineReader sources (a layer under the LineRecordReader explained
above), where we also can see the beyond-block-boundary
Thanks for the explanation HJ - I always meant to look into that bit of
code to work out how it did it.
Tim
On Wed, Sep 19, 2012 at 6:24 PM, Harsh J wrote:
> Hi Tim,
>
> Splits don't look at newlines in the TextInputFormat at least. So
> since the computed splits > default map numbers, I thin
Hi Tim,
Splits don't look at newlines in the TextInputFormat at least. So
since the computed splits > default map numbers, I think a perfect
file of 10 blocks will spawn only 10 mappers. The mapper's record
reader is the one that reads until a newline (even after the end of
its block length bytes)
I think the splitting recognises the end of line, so you might get 11 but
otherwise that looks correct.
On Wed, Sep 19, 2012 at 5:42 PM, Pedro Sá da Costa wrote:
>
>
> If I've an input file of 640MB in size, and a split size of 64Mb, this
> file will be partitioned in 10 splits, and each split