Re: Question on HDFS_BYTES_READ and HDFS_BYTES_WRITTEN

Jim Twensky Fri, 17 May 2013 14:50:31 -0700

Hello Harsh,

Thanks for the useful feedback. You were right. My map tasks open
additional files from hdfs. The catch was that I had thousands of map tasks
being created and each of them was repeatedly reading the same files from
hdfs which ultimately dominated the job execution time. I re-arranged the
minimum split size for the job and reduced the number of map tasks spawned
by the master node.


Cheers,
Jim


On Thu, May 16, 2013 at 2:56 AM, Harsh J <[email protected]> wrote:

> Hi Jim,
>
> The counters you're looking at are counted at the FileSystem interface
> level, not at the more specific Task level (which have "map input
> bytes" and such).
>
> This means that if your map or reduce code is opening side-files/using
> a FileSystem object to read extra things, the count will go up as
> expected.
>
> For simple input and output size validation of a job, minus anything
> the code does on top, its better to look at "map/reduce input/output
> bytes" form of counters instead.
>
> On Tue, May 14, 2013 at 10:41 PM, Jim Twensky <[email protected]>
> wrote:
> > I have an iterative MapReduce job that I run over 35 GB of data
> repeatedly.
> > The output of the first job is the input to the second one and it goes on
> > like that until convergence.
> >
> > I am seeing a strange behavior with the program run time. The first
> > iteration takes 4 minutes to run and here is how the counters look:
> >
> > HDFS_BYTES_READ                    34,860,867,377
> > HDFS_BYTES_WRITTEN               45,573,255,806
> >
> > The second iteration takes 15 minutes and here is how the counters look
> in
> > this case:
> >
> > HDFS_BYTES_READ                   144,563,459,448
> > HDFS_BYTES_WRITTEN               49,779,966,388
> >
> > I cannot explain these numbers because the first iteration - to begin
> with -
> > should only generate approximately 35 GB of output. When I check the
> output
> > size using
> >
> > hadoop fs -dus
> >
> > I can confirm that it is indeed 35 GB. But for some reason
> > HDFS_BYTES_WRITTEN shows 45 GB. Then the input to the second iteration
> > should be 35 GB (or even 45GB considering the counter value)
> > but HDFS_BYTES_READ shows 144 GB.
> >
> > All following iterations produce similar counter values to the second one
> > and they roughly take 15 min each. My dfs replication factor is set to 1
> and
> > there is no compression turned on. All input and outputs are in
> SequenceFile
> > format. The initial input is a sequence file that I generated locally
> using
> > SequenceFile.Writer but I use the default values and as far as I know
> > compression should be turned off.  Am I wrong?
> >
> > Thanks in advance.
>
>
>
> --
> Harsh J
>

Re: Question on HDFS_BYTES_READ and HDFS_BYTES_WRITTEN

Reply via email to