> That was the Hfile block size.. How different is this 'block' different from > that of HDFS? Never mind.. Got the answer.
Thank you Vidhya On 6/11/10 3:13 PM, "Todd Lipcon" <[email protected]> wrote: On Fri, Jun 11, 2010 at 3:07 PM, Vidhyashankar Venkataraman < [email protected]> wrote: > >> Do you have profiling output from your HFile writers? > Do you mean the debug output in the logs? > > I was suggesting running a Java profiler (eg YourKit or the built-in hprof profiler) to see where the time is going. I recall you saying you're new-ish to Java, but I know some of the Grid Solutions guys over there are pretty expert users of the profiler. > Can it also be due to the numerous per-block queries to the namenode? (Now > that the block size is so low) > I wasn't clear, is the 1MB block size your HDFS block size or your HFile block size? I wouldn't recommend such a tiny HDFS block size - we usually go 128MB or 256M. It could definitely slow you down. -Todd > > On 6/11/10 3:01 PM, "Todd Lipcon" <[email protected]> wrote: > > Hi Vidhya, > > Do you have profiling output from your HFile writers? > > Since you have a standalone program that should be doing little except > writing, I imagine the profiler output would be pretty useful in seeing > where the bottleneck lies. > > My guess is that you're CPU bound on serialization - serialization is often > slow slow slow. > > -Todd > > > On Fri, Jun 11, 2010 at 2:54 PM, Vidhyashankar Venkataraman < > [email protected]> wrote: > > > The last couple of days I have been running into some bottleneck issues > > with writing HFiles that I am unable to figure out. I am using the > > Hfile.writer to prepare a bunch of Hfiles (using Hfile.writer: Hfile is > > similar to a TFile) to bulk load and I have been getting suspiciously low > > values for the throughput.. > > > > I am not using MR to create my files.. I prepare data on the fly and dump > > Hfiles almost exactly like what HFileOutputFormat does.. > > > > This is my current setup: (almost similar to what I had been saying in my > > prev emails) > > Individual output file size is 2 GB.. Block size of 1 MB. I am writing > > multiple such files to build the entire db.. Each client program writes > > files one after another.. > > Each Key-value pair is around 15 KB.. > > 5 datanodes.. > > Each dn also runs 5 instances of my client program. (25 processes in all) > > And I get a throughput of around 100 rows per second per node (that > comes > > to around 1.5 MBps per node) > > Expectedly, neither the disk nor the network is the bottleneck.. > > > > Are there any config values that I need to take care of? > > > > > > With copyFromLocal command of Hadoop, I can get really much better > > throughputs: 50MBps with just one process.. (of course, the block size is > > much larger in that case).. > > > > Thanks in advance :) > > Vidhya > > > > On 6/11/10 12:44 PM, "Pei Lin Ong" <[email protected]> wrote: > > > > Hi Milind and Koji, > > > > Vidhya is one of the Search devs working on Web Crawl Cache cluster > > (ingesting crawled content from Bing). > > > > He is currently looking at different technology choices, such as HBase, > for > > the cluster configuration. Vidhya has run into a Hadoop HDFS issue and is > > looking for help. > > > > I have suggested he pose the question via this thread as Vidhya indicates > > it is urgent due to the WCC timetable. > > > > Please accommodate this request and see if you can answer Vidhya's > question > > (after he poses it). Should the question require further discussion, then > > Vidhya or I will file a ticket. > > > > Thank you! > > Pei > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera > > -- Todd Lipcon Software Engineer, Cloudera
