>> Do you have profiling output from your HFile writers? Do you mean the debug output in the logs?
Can it also be due to the numerous per-block queries to the namenode? (Now that the block size is so low) Thank you V On 6/11/10 3:01 PM, "Todd Lipcon" <[email protected]> wrote: Hi Vidhya, Do you have profiling output from your HFile writers? Since you have a standalone program that should be doing little except writing, I imagine the profiler output would be pretty useful in seeing where the bottleneck lies. My guess is that you're CPU bound on serialization - serialization is often slow slow slow. -Todd On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman < [email protected]> wrote: > The last couple of days I have been running into some bottleneck issues > with writing HFiles that I am unable to figure out. I am using the > Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is > similar to a TFile) to bulk load and I have been getting suspiciously low > values for the throughput.. > > I am not using MR to create my files.. I prepare data on the fly and dump > Hfiles almost exactly like what HFileOutputFormat does.. > > This is my current setup: (almost similar to what I had been saying in my > prev emails) > Individual output file size is 2 GB.. Block size of 1 MB. I am writing > multiple such files to build the entire db.. Each client program writes > files one after another.. > Each Key-value pair is around 15 KB.. > 5 datanodes.. > Each dn also runs 5 instances of my client program. (25 processes in all) > And I get a throughput of around 100 rows per second per node (that comes > to around 1.5 MBps per node) > Expectedly, neither the disk nor the network is the bottleneck.. > > Are there any config values that I need to take care of? > > > With copyFromLocal command of Hadoop, I can get really much better > throughputs: 50MBps with just one process.. (of course, the block size is > much larger in that case).. > > Thanks in advance :) > Vidhya > > On 6/11/10 12:44 PM, "Pei Lin Ong" <[email protected]> wrote: > > Hi Milind and Koji, > > Vidhya is one of the Search devs working on Web Crawl Cache cluster > (ingesting crawled content from Bing). > > He is currently looking at different technology choices, such as HBase, for > the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is > looking for help. > > I have suggested he pose the question via this thread as Vidhya indicates > it is urgent due to the WCC timetable. > > Please accommodate this request and see if you can answer Vidhya's question > (after he poses it). Should the question require further discussion, then > Vidhya or I will file a ticket. > > Thank you! > Pei > -- Todd Lipcon Software Engineer, Cloudera
