Re: hdsf block size cont.

2011-03-17 Thread Harsh J
Not in case of .gz files [Since there is no splitting done, the mapper shall possibly read 128 MB locally from a resident DN, and then could read the remaining 128 MB over the network from another DN if the next block does not reside on the same DN as well -- thereby introducing a network read cost

Re: hdsf block size cont.

2011-03-17 Thread Lior Schachter
yes. but with 128M gzip files/block size the M/R will work better ? no ? anyhow, thanks for the useful information. On Thu, Mar 17, 2011 at 5:07 PM, Harsh J wrote: > On Thu, Mar 17, 2011 at 7:51 PM, Lior Schachter > wrote: > > Currently each gzip file is about 250MB (*60files=15G) so we have 2

Re: hdsf block size cont.

2011-03-17 Thread Harsh J
On Thu, Mar 17, 2011 at 7:51 PM, Lior Schachter wrote: > Currently each gzip file is about 250MB (*60files=15G) so we have 256M > blocks. Darn, I ought to sleep a bit more. I did a file/gb and read it as gb/file mehh.. > > However I understand that in order to utilize better M/R parallel process

Re: hdsf block size cont.

2011-03-17 Thread Lior Schachter
Currently each gzip file is about 250MB (*60files=15G) so we have 256M blocks. However I understand that in order to utilize better M/R parallel processing smaller files/blocks are better. So maybe having 128M gzip files with coreesponding 128M block size would be better? On Thu, Mar 17, 2011 a

Re: hdsf block size cont.

2011-03-17 Thread Harsh J
On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter wrote: > Hi, > If I have is big gzip files (>>block size) does the M/R will split a single > file to multiple blocks and send them to different mappers ? > The behavior I currently see is that a map is still open per file (and not > per block). Yes

hdsf block size cont.

2011-03-17 Thread Lior Schachter
Hi, If I have is big gzip files (>>block size) does the M/R will split a single file to multiple blocks and send them to different mappers ? The behavior I currently see is that a map is still open per file (and not per block). I will also appreciate it if you can share your experience in definin

Re: hdsf block size

2011-03-17 Thread Lior Schachter
we have altogether 15G data to process every day (multiple M/R jobs running on the same set of data). currently we split this data to 60 files (but we can also split them to 120 files). we have 15 machines with quad core. Thanks, Lior On Thu, Mar 17, 2011 at 11:01 AM, Harsh J wrote: > 15 G sin

Re: hdsf block size

2011-03-17 Thread Harsh J
15 G single Gzip files? Consider block sizes in 0.5 GB+. But it also depends on the processing slot-power you have. Higher blocks would lead to higher usage of processing capacity, although with higher load to the NameNode in maintaining lots of blocks (and replicas per) per file. On Thu, Mar 17,

hdsf block size

2011-03-17 Thread Lior Schachter
Hi, We plan a 100T cluster with M/R jobs running on 15G gzip files. Should we configure HDFS block to be 128M or 256M. Thanks, Lior