Re: Low throughputs while writing Hfiles using Hfile.writer

Vidhyashankar Venkataraman Fri, 11 Jun 2010 16:59:36 -0700

> That was the Hfile block size.. How different is this 'block' different from 
> that of HDFS?
  Never mind.. Got the answer.


Thank you
Vidhya

On 6/11/10 3:13 PM, "Todd Lipcon" <[email protected]> wrote:

On Fri, Jun 11, 2010 at 3:07 PM, Vidhyashankar Venkataraman <
[email protected]> wrote:

> >> Do you have profiling output from your HFile writers?
> Do you mean the debug output in the logs?
>
>
I was suggesting running a Java profiler (eg YourKit or the built-in hprof
profiler) to see where the time is going. I recall you saying you're new-ish
to Java, but I know some of the Grid Solutions guys over there are pretty
expert users of the profiler.


> Can it also be due to the numerous per-block queries to the namenode? (Now
> that the block size is so low)
>

I wasn't clear, is the 1MB block size your HDFS block size or your HFile
block size? I wouldn't recommend such a tiny HDFS block size - we usually go
128MB or 256M. It could definitely slow you down.

-Todd


>
> On 6/11/10 3:01 PM, "Todd Lipcon" <[email protected]> wrote:
>
> Hi Vidhya,
>
> Do you have profiling output from your HFile writers?
>
> Since you have a standalone program that should be doing little except
> writing, I imagine the profiler output would be pretty useful in seeing
> where the bottleneck lies.
>
> My guess is that you're CPU bound on serialization - serialization is often
> slow slow slow.
>
> -Todd
>
>
> On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman <
> [email protected]> wrote:
>
> > The last couple of days I have been running into some bottleneck issues
> > with writing HFiles that I am unable to figure out. I am using the
> > Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is
> > similar to a TFile) to bulk load and I have been getting suspiciously low
> > values for the throughput..
> >
> > I am not using MR to create my files.. I prepare data on the fly and dump
> > Hfiles almost exactly like what HFileOutputFormat does..
> >
> > This is my current setup: (almost similar to what I had been saying in my
> > prev emails)
> > Individual output file size is 2 GB.. Block size of 1 MB. I am writing
> > multiple such files to build the entire db.. Each client program writes
> > files one after another..
> > Each Key-value pair is around 15 KB..
> > 5 datanodes..
> > Each dn also runs 5 instances of my client program. (25 processes in all)
> >  And I get a throughput of around 100 rows per second per node (that
> comes
> > to around 1.5 MBps per node)
> > Expectedly, neither the disk  nor the network is the bottleneck..
> >
> > Are there any config values that I need to take care of?
> >
> >
> > With copyFromLocal command of Hadoop, I can get really much better
> > throughputs: 50MBps with just one process.. (of course, the block size is
> > much larger in that case)..
> >
> > Thanks in advance :)
> > Vidhya
> >
> > On 6/11/10 12:44 PM, "Pei Lin Ong" <[email protected]> wrote:
> >
> > Hi Milind and Koji,
> >
> > Vidhya is one of the Search devs working on Web Crawl Cache cluster
> > (ingesting crawled content from Bing).
> >
> > He is currently looking at different technology choices, such as HBase,
> for
> > the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is
> > looking for help.
> >
> > I have suggested he pose the question via this thread as Vidhya indicates
> > it is urgent due to the WCC timetable.
> >
> > Please accommodate this request and see if you can answer Vidhya's
> question
> > (after he poses it). Should the question require further discussion, then
> > Vidhya or I will file a ticket.
> >
> > Thank you!
> > Pei
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>


--
Todd Lipcon
Software Engineer, Cloudera

Re: Low throughputs while writing Hfiles using Hfile.writer

Reply via email to