Not in case of .gz files [Since there is no splitting done, the mapper
shall possibly read 128 MB locally from a resident DN, and then could
read the remaining 128 MB over the network from another DN if the next
block does not reside on the same DN as well -- thereby introducing a
network read cost
yes. but with 128M gzip files/block size the M/R will work better ? no ?
anyhow, thanks for the useful information.
On Thu, Mar 17, 2011 at 5:07 PM, Harsh J wrote:
> On Thu, Mar 17, 2011 at 7:51 PM, Lior Schachter
> wrote:
> > Currently each gzip file is about 250MB (*60files=15G) so we have 2
On Thu, Mar 17, 2011 at 7:51 PM, Lior Schachter wrote:
> Currently each gzip file is about 250MB (*60files=15G) so we have 256M
> blocks.
Darn, I ought to sleep a bit more. I did a file/gb and read it as gb/file mehh..
>
> However I understand that in order to utilize better M/R parallel process
Currently each gzip file is about 250MB (*60files=15G) so we have 256M
blocks.
However I understand that in order to utilize better M/R parallel processing
smaller files/blocks are better.
So maybe having 128M gzip files with coreesponding 128M block size would be
better?
On Thu, Mar 17, 2011 a
On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter wrote:
> Hi,
> If I have is big gzip files (>>block size) does the M/R will split a single
> file to multiple blocks and send them to different mappers ?
> The behavior I currently see is that a map is still open per file (and not
> per block).
Yes
Hi,
If I have is big gzip files (>>block size) does the M/R will split a single
file to multiple blocks and send them to different mappers ?
The behavior I currently see is that a map is still open per file (and not
per block).
I will also appreciate it if you can share your experience in definin
we have altogether 15G data to process every day (multiple M/R jobs running
on the same set of data).
currently we split this data to 60 files (but we can also split them to 120
files).
we have 15 machines with quad core.
Thanks,
Lior
On Thu, Mar 17, 2011 at 11:01 AM, Harsh J wrote:
> 15 G sin
15 G single Gzip files? Consider block sizes in 0.5 GB+. But it also
depends on the processing slot-power you have. Higher blocks would
lead to higher usage of processing capacity, although with higher load
to the NameNode in maintaining lots of blocks (and replicas per) per
file.
On Thu, Mar 17,
Hi,
We plan a 100T cluster with M/R jobs running on 15G gzip files.
Should we configure HDFS block to be 128M or 256M.
Thanks,
Lior