The last couple of days I have been running into some bottleneck issues with writing HFiles that I am unable to figure out. I am using the Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is similar to a TFile) to bulk load and I have been getting suspiciously low values for the throughput..
I am not using MR to create my files.. I prepare data on the fly and dump Hfiles almost exactly like what HFileOutputFormat does.. This is my current setup: (almost similar to what I had been saying in my prev emails) Individual output file size is 2 GB.. Block size of 1 MB. I am writing multiple such files to build the entire db.. Each client program writes files one after another.. Each Key-value pair is around 15 KB.. 5 datanodes.. Each dn also runs 5 instances of my client program. (25 processes in all) And I get a throughput of around 100 rows per second per node (that comes to around 1.5 MBps per node) Expectedly, neither the disk nor the network is the bottleneck.. Are there any config values that I need to take care of? With copyFromLocal command of Hadoop, I can get really much better throughputs: 50MBps with just one process.. (of course, the block size is much larger in that case).. Thanks in advance :) Vidhya On 6/11/10 12:44 PM, "Pei Lin Ong" <[email protected]> wrote: Hi Milind and Koji, Vidhya is one of the Search devs working on Web Crawl Cache cluster (ingesting crawled content from Bing). He is currently looking at different technology choices, such as HBase, for the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is looking for help. I have suggested he pose the question via this thread as Vidhya indicates it is urgent due to the WCC timetable. Please accommodate this request and see if you can answer Vidhya's question (after he poses it). Should the question require further discussion, then Vidhya or I will file a ticket. Thank you! Pei
