The last couple of days I have been running into some bottleneck issues with 
writing HFiles that I am unable to figure out. I am using the Hfile.writer to 
prepare a bunch of Hfiles (using Hfile.writer: Hfile is similar to a TFile) to 
bulk load and I have been getting suspiciously low values for the throughput..

I am not using MR to create my files.. I prepare data on the fly and dump 
Hfiles almost exactly like what HFileOutputFormat does..

This is my current setup: (almost similar to what I had been saying in my prev 
emails)
Individual output file size is 2 GB.. Block size of 1 MB. I am writing multiple 
such files to build the entire db.. Each client program writes files one after 
another..
Each Key-value pair is around 15 KB..
5 datanodes..
Each dn also runs 5 instances of my client program. (25 processes in all)
 And I get a throughput of around 100 rows per second per node (that comes to 
around 1.5 MBps per node)
Expectedly, neither the disk  nor the network is the bottleneck..

Are there any config values that I need to take care of?


With copyFromLocal command of Hadoop, I can get really much better throughputs: 
50MBps with just one process.. (of course, the block size is much larger in 
that case)..

Thanks in advance :)
Vidhya

On 6/11/10 12:44 PM, "Pei Lin Ong" <[email protected]> wrote:

Hi Milind and Koji,

Vidhya is one of the Search devs working on Web Crawl Cache cluster (ingesting 
crawled content from Bing).

He is currently looking at different technology choices, such as HBase, for the 
cluster configuration. Vidhya has run into a Hadoop HDFS issue and is looking 
for help.

I have suggested he pose the question via this thread as Vidhya indicates it is 
urgent due to the WCC timetable.

Please accommodate this request and see if you can answer Vidhya's question 
(after he poses it). Should the question require further discussion, then 
Vidhya or I will file a ticket.

Thank you!
Pei

Reply via email to