>> Do you have profiling output from your HFile writers?
Do you mean the debug output in the logs?

Can it also be due to the numerous per-block queries to the namenode? (Now that 
the block size is so low)

Thank you
V

On 6/11/10 3:01 PM, "Todd Lipcon" <[email protected]> wrote:

Hi Vidhya,

Do you have profiling output from your HFile writers?

Since you have a standalone program that should be doing little except
writing, I imagine the profiler output would be pretty useful in seeing
where the bottleneck lies.

My guess is that you're CPU bound on serialization - serialization is often
slow slow slow.

-Todd


On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman <
[email protected]> wrote:

> The last couple of days I have been running into some bottleneck issues
> with writing HFiles that I am unable to figure out. I am using the
> Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is
> similar to a TFile) to bulk load and I have been getting suspiciously low
> values for the throughput..
>
> I am not using MR to create my files.. I prepare data on the fly and dump
> Hfiles almost exactly like what HFileOutputFormat does..
>
> This is my current setup: (almost similar to what I had been saying in my
> prev emails)
> Individual output file size is 2 GB.. Block size of 1 MB. I am writing
> multiple such files to build the entire db.. Each client program writes
> files one after another..
> Each Key-value pair is around 15 KB..
> 5 datanodes..
> Each dn also runs 5 instances of my client program. (25 processes in all)
>  And I get a throughput of around 100 rows per second per node (that comes
> to around 1.5 MBps per node)
> Expectedly, neither the disk  nor the network is the bottleneck..
>
> Are there any config values that I need to take care of?
>
>
> With copyFromLocal command of Hadoop, I can get really much better
> throughputs: 50MBps with just one process.. (of course, the block size is
> much larger in that case)..
>
> Thanks in advance :)
> Vidhya
>
> On 6/11/10 12:44 PM, "Pei Lin Ong" <[email protected]> wrote:
>
> Hi Milind and Koji,
>
> Vidhya is one of the Search devs working on Web Crawl Cache cluster
> (ingesting crawled content from Bing).
>
> He is currently looking at different technology choices, such as HBase, for
> the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is
> looking for help.
>
> I have suggested he pose the question via this thread as Vidhya indicates
> it is urgent due to the WCC timetable.
>
> Please accommodate this request and see if you can answer Vidhya's question
> (after he poses it). Should the question require further discussion, then
> Vidhya or I will file a ticket.
>
> Thank you!
> Pei
>



--
Todd Lipcon
Software Engineer, Cloudera

Reply via email to