Re: Low CPU usage and slow reads in pseudo-distributed mode - how to fix?
@Ted, Pseudo cluster on a machine that has 4GB of memory. If you give HBase 1.5GB for the region server… you are left with 2.5 GB of memory for everything else. You will swap. In short, nothing he can do will help. He’s screwed if he is trying to look improving performance. On Jan 11, 2015, at 12:19 AM, Ted Yu yuzhih...@gmail.com wrote: Please see http://hbase.apache.org/book.html#perf.reading I guess you use 0.90.4 because of Nutch integration. Still 0.90.x was way too old. bq. HBase has a heapsize of 1.5 Gigs This is not enough memory for good read performance. Please consider giving HBase more heap. Cheers On Sat, Jan 10, 2015 at 4:04 PM, Dave Benson davehben...@gmail.com wrote: Hi HBase users, I'm working HBase for the first time and I'm trying to sort out a performance issue. HBase is the data store for a small, focused web crawl I'm performing with Apache Nutch. I'm running in pseudo-distributed mode, meaning that Nutch, HBase and Hadoop are all on the same machine. The machine's a few years old and has only 4 gigs of RAM - much smaller than most HBase installs, I know. When I first start my HBase processes I get about 60 seconds of fast performance. Hbase reads quickly and uses a healthy portion CPU cycles. After a minute or so, though, HBase slows dramatically. Reads sink to a glacial pace, and the CPU sits mostly idle. I notice this pattern when I run Nutch - particularly during read-heavy operations - but also when I run a simple row counter from the shell. At the moment count 'my_table' takes almost 4 hours to read through 500 000 rows. The reading is much faster at the start than the end. In the first 30 seconds, HBase counts 37000 rows, but in the 30 seconds between 8:00 and 8:30, only 1000 are counted. Looking through my Ganglia report I see a brief return to high performance around 3 hours into the count. I don't know what's causing this spike. Can anyone suggest what configuration parameters I should change to improve read performance? Or what reference materials I should consult to better understand the problem? Again, I'm totally new to HBase. I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5 Gigs. Here's a Ganglia report covering the 4 hours of count 'my_table' : http://imgur.com/Aa3eukZ Please let me know if I can provide any more information. Many thanks, Dave
Re: Low CPU usage and slow reads in pseudo-distributed mode - how to fix?
Dave: As Michael suggests, you seem to be swapping going by your ganglia graph (the purple squiggles that often go above the 4G mark in the top right-hand memory graph). Swapping will put a stake in your throughput. Try lowering thresholds so you are not swapping. Stuff should run a little smoother. See http://hbase.apache.org/book.html#perf.os.swap St.Ack On Sun, Jan 11, 2015 at 6:49 AM, Michael Segel michael_se...@hotmail.com wrote: @Ted, Pseudo cluster on a machine that has 4GB of memory. If you give HBase 1.5GB for the region server… you are left with 2.5 GB of memory for everything else. You will swap. In short, nothing he can do will help. He’s screwed if he is trying to look improving performance. On Jan 11, 2015, at 12:19 AM, Ted Yu yuzhih...@gmail.com wrote: Please see http://hbase.apache.org/book.html#perf.reading I guess you use 0.90.4 because of Nutch integration. Still 0.90.x was way too old. bq. HBase has a heapsize of 1.5 Gigs This is not enough memory for good read performance. Please consider giving HBase more heap. Cheers On Sat, Jan 10, 2015 at 4:04 PM, Dave Benson davehben...@gmail.com wrote: Hi HBase users, I'm working HBase for the first time and I'm trying to sort out a performance issue. HBase is the data store for a small, focused web crawl I'm performing with Apache Nutch. I'm running in pseudo-distributed mode, meaning that Nutch, HBase and Hadoop are all on the same machine. The machine's a few years old and has only 4 gigs of RAM - much smaller than most HBase installs, I know. When I first start my HBase processes I get about 60 seconds of fast performance. Hbase reads quickly and uses a healthy portion CPU cycles. After a minute or so, though, HBase slows dramatically. Reads sink to a glacial pace, and the CPU sits mostly idle. I notice this pattern when I run Nutch - particularly during read-heavy operations - but also when I run a simple row counter from the shell. At the moment count 'my_table' takes almost 4 hours to read through 500 000 rows. The reading is much faster at the start than the end. In the first 30 seconds, HBase counts 37000 rows, but in the 30 seconds between 8:00 and 8:30, only 1000 are counted. Looking through my Ganglia report I see a brief return to high performance around 3 hours into the count. I don't know what's causing this spike. Can anyone suggest what configuration parameters I should change to improve read performance? Or what reference materials I should consult to better understand the problem? Again, I'm totally new to HBase. I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5 Gigs. Here's a Ganglia report covering the 4 hours of count 'my_table' : http://imgur.com/Aa3eukZ Please let me know if I can provide any more information. Many thanks, Dave
Re: Low CPU usage and slow reads in pseudo-distributed mode - how to fix?
St.Ack, Michael and Ted - thanks for your responses. On Sun, Jan 11, 2015 at 1:38 PM, Stack st...@duboce.net wrote: Dave: As Michael suggests, you seem to be swapping going by your ganglia graph (the purple squiggles that often go above the 4G mark in the top right-hand memory graph). Swapping will put a stake in your throughput. Try lowering thresholds so you are not swapping. Stuff should run a little smoother. See http://hbase.apache.org/book.html#perf.os.swap St.Ack - I hope you can forgive the question, but which thresholds should I be lowering, exactly? Do you mean that I should decrease the HBase heapsize until RAM usage stays below 4G? Thanks, Dave St.Ack On Sun, Jan 11, 2015 at 6:49 AM, Michael Segel michael_se...@hotmail.com wrote: @Ted, Pseudo cluster on a machine that has 4GB of memory. If you give HBase 1.5GB for the region server… you are left with 2.5 GB of memory for everything else. You will swap. In short, nothing he can do will help. He’s screwed if he is trying to look improving performance. On Jan 11, 2015, at 12:19 AM, Ted Yu yuzhih...@gmail.com wrote: Please see http://hbase.apache.org/book.html#perf.reading I guess you use 0.90.4 because of Nutch integration. Still 0.90.x was way too old. bq. HBase has a heapsize of 1.5 Gigs This is not enough memory for good read performance. Please consider giving HBase more heap. Cheers On Sat, Jan 10, 2015 at 4:04 PM, Dave Benson davehben...@gmail.com wrote: Hi HBase users, I'm working HBase for the first time and I'm trying to sort out a performance issue. HBase is the data store for a small, focused web crawl I'm performing with Apache Nutch. I'm running in pseudo-distributed mode, meaning that Nutch, HBase and Hadoop are all on the same machine. The machine's a few years old and has only 4 gigs of RAM - much smaller than most HBase installs, I know. When I first start my HBase processes I get about 60 seconds of fast performance. Hbase reads quickly and uses a healthy portion CPU cycles. After a minute or so, though, HBase slows dramatically. Reads sink to a glacial pace, and the CPU sits mostly idle. I notice this pattern when I run Nutch - particularly during read-heavy operations - but also when I run a simple row counter from the shell. At the moment count 'my_table' takes almost 4 hours to read through 500 000 rows. The reading is much faster at the start than the end. In the first 30 seconds, HBase counts 37000 rows, but in the 30 seconds between 8:00 and 8:30, only 1000 are counted. Looking through my Ganglia report I see a brief return to high performance around 3 hours into the count. I don't know what's causing this spike. Can anyone suggest what configuration parameters I should change to improve read performance? Or what reference materials I should consult to better understand the problem? Again, I'm totally new to HBase. I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5 Gigs. Here's a Ganglia report covering the 4 hours of count 'my_table' : http://imgur.com/Aa3eukZ Please let me know if I can provide any more information. Many thanks, Dave
Re: Low CPU usage and slow reads in pseudo-distributed mode - how to fix?
On Sun, Jan 11, 2015 at 1:35 PM, Dave Benson davehben...@gmail.com wrote: St.Ack, Michael and Ted - thanks for your responses. ... St.Ack - I hope you can forgive the question, but which thresholds should I be lowering, exactly? Do you mean that I should decrease the HBase heapsize until RAM usage stays below 4G? Do whatever it takes to stop the swapping (Buying an extra DRAM stick to put in the machine might be your best bet). Setting swappyness down low -- 0 or 1% or so -- would help though it looks on your system that swappyness is already set low. Down the HBase heap size yeah -- 1G or lower -- and any other process heaps you have running proportionally (e.g. cut down your nutch MR task heap size allocations too -- if you can get away with it and have them still run to completion). You should be able to get a basic system running. It will likely be i/o bound since you are not able to cache much in the HBase heap since you have so little. St.Ack
Low CPU usage and slow reads in pseudo-distributed mode - how to fix?
Hi HBase users, I'm working HBase for the first time and I'm trying to sort out a performance issue. HBase is the data store for a small, focused web crawl I'm performing with Apache Nutch. I'm running in pseudo-distributed mode, meaning that Nutch, HBase and Hadoop are all on the same machine. The machine's a few years old and has only 4 gigs of RAM - much smaller than most HBase installs, I know. When I first start my HBase processes I get about 60 seconds of fast performance. Hbase reads quickly and uses a healthy portion CPU cycles. After a minute or so, though, HBase slows dramatically. Reads sink to a glacial pace, and the CPU sits mostly idle. I notice this pattern when I run Nutch - particularly during read-heavy operations - but also when I run a simple row counter from the shell. At the moment count 'my_table' takes almost 4 hours to read through 500 000 rows. The reading is much faster at the start than the end. In the first 30 seconds, HBase counts 37000 rows, but in the 30 seconds between 8:00 and 8:30, only 1000 are counted. Looking through my Ganglia report I see a brief return to high performance around 3 hours into the count. I don't know what's causing this spike. Can anyone suggest what configuration parameters I should change to improve read performance? Or what reference materials I should consult to better understand the problem? Again, I'm totally new to HBase. I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5 Gigs. Here's a Ganglia report covering the 4 hours of count 'my_table' : http://imgur.com/Aa3eukZ Please let me know if I can provide any more information. Many thanks, Dave
Re: Low CPU usage and slow reads in pseudo-distributed mode - how to fix?
Please see http://hbase.apache.org/book.html#perf.reading I guess you use 0.90.4 because of Nutch integration. Still 0.90.x was way too old. bq. HBase has a heapsize of 1.5 Gigs This is not enough memory for good read performance. Please consider giving HBase more heap. Cheers On Sat, Jan 10, 2015 at 4:04 PM, Dave Benson davehben...@gmail.com wrote: Hi HBase users, I'm working HBase for the first time and I'm trying to sort out a performance issue. HBase is the data store for a small, focused web crawl I'm performing with Apache Nutch. I'm running in pseudo-distributed mode, meaning that Nutch, HBase and Hadoop are all on the same machine. The machine's a few years old and has only 4 gigs of RAM - much smaller than most HBase installs, I know. When I first start my HBase processes I get about 60 seconds of fast performance. Hbase reads quickly and uses a healthy portion CPU cycles. After a minute or so, though, HBase slows dramatically. Reads sink to a glacial pace, and the CPU sits mostly idle. I notice this pattern when I run Nutch - particularly during read-heavy operations - but also when I run a simple row counter from the shell. At the moment count 'my_table' takes almost 4 hours to read through 500 000 rows. The reading is much faster at the start than the end. In the first 30 seconds, HBase counts 37000 rows, but in the 30 seconds between 8:00 and 8:30, only 1000 are counted. Looking through my Ganglia report I see a brief return to high performance around 3 hours into the count. I don't know what's causing this spike. Can anyone suggest what configuration parameters I should change to improve read performance? Or what reference materials I should consult to better understand the problem? Again, I'm totally new to HBase. I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5 Gigs. Here's a Ganglia report covering the 4 hours of count 'my_table' : http://imgur.com/Aa3eukZ Please let me know if I can provide any more information. Many thanks, Dave