Re: Low CPU usage and slow reads in pseudo-distributed mode - how to fix?

2015-01-11 Thread Michael Segel
@Ted, 
Pseudo cluster on a machine that has 4GB of memory. 
If you give HBase 1.5GB for the region server… you are left with 2.5 GB of 
memory for everything else. 
You will swap. 

In short, nothing he can do will help. He’s screwed if he is trying to look 
improving performance. 


On Jan 11, 2015, at 12:19 AM, Ted Yu yuzhih...@gmail.com wrote:

 Please see http://hbase.apache.org/book.html#perf.reading
 
 I guess you use 0.90.4 because of Nutch integration. Still 0.90.x was way
 too old.
 
 bq. HBase has a heapsize of 1.5 Gigs
 
 This is not enough memory for good read performance. Please consider giving
 HBase more heap.
 
 Cheers
 
 
 On Sat, Jan 10, 2015 at 4:04 PM, Dave Benson davehben...@gmail.com wrote:
 
 Hi HBase users,
 
 I'm working HBase for the first time and I'm trying to sort out a
 performance issue. HBase is the data store for a small, focused web crawl
 I'm performing with Apache Nutch. I'm running in pseudo-distributed mode,
 meaning that Nutch, HBase and Hadoop are all on the same machine. The
 machine's a few years old and has only 4 gigs of RAM - much smaller than
 most HBase installs, I know.
 
 When I first start my HBase processes I get about 60 seconds of fast
 performance. Hbase reads quickly and uses a healthy portion CPU cycles.
 After a minute or so, though, HBase slows dramatically. Reads sink to a
 glacial pace, and the CPU sits mostly idle.
 
 I notice this pattern when I run Nutch - particularly during read-heavy
 operations - but also when I run a simple row counter from the shell.
 
 At the moment  count 'my_table'  takes almost 4 hours to read through 500
 000 rows. The reading is much faster at the start than the end.  In the
 first 30 seconds, HBase counts 37000 rows, but in the 30 seconds between
 8:00 and 8:30, only 1000 are counted.
 
 Looking through my Ganglia report I see a brief return to high performance
 around 3 hours into the count. I don't know what's causing this spike.
 
 
 Can anyone suggest what configuration parameters I should change to improve
 read performance?  Or what reference materials I should consult to better
 understand the problem?  Again, I'm totally new to HBase.
 
 I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5 Gigs.
 
 Here's a Ganglia report covering the 4 hours of  count 'my_table' :
 http://imgur.com/Aa3eukZ
 
 Please let me know if I can provide any more information.
 
 Many thanks,
 
 
 Dave
 



Re: Low CPU usage and slow reads in pseudo-distributed mode - how to fix?

2015-01-11 Thread Stack
Dave:

As Michael suggests, you seem to be swapping going by your ganglia graph
(the purple squiggles that often go above the 4G mark in the top right-hand
memory graph).  Swapping will put a stake in your throughput.  Try lowering
thresholds so you are not swapping.  Stuff should run a little smoother.
See http://hbase.apache.org/book.html#perf.os.swap

St.Ack

On Sun, Jan 11, 2015 at 6:49 AM, Michael Segel michael_se...@hotmail.com
wrote:

 @Ted,
 Pseudo cluster on a machine that has 4GB of memory.
 If you give HBase 1.5GB for the region server… you are left with 2.5 GB of
 memory for everything else.
 You will swap.

 In short, nothing he can do will help. He’s screwed if he is trying to
 look improving performance.


 On Jan 11, 2015, at 12:19 AM, Ted Yu yuzhih...@gmail.com wrote:

  Please see http://hbase.apache.org/book.html#perf.reading
 
  I guess you use 0.90.4 because of Nutch integration. Still 0.90.x was way
  too old.
 
  bq. HBase has a heapsize of 1.5 Gigs
 
  This is not enough memory for good read performance. Please consider
 giving
  HBase more heap.
 
  Cheers
 
 
  On Sat, Jan 10, 2015 at 4:04 PM, Dave Benson davehben...@gmail.com
 wrote:
 
  Hi HBase users,
 
  I'm working HBase for the first time and I'm trying to sort out a
  performance issue. HBase is the data store for a small, focused web
 crawl
  I'm performing with Apache Nutch. I'm running in pseudo-distributed
 mode,
  meaning that Nutch, HBase and Hadoop are all on the same machine. The
  machine's a few years old and has only 4 gigs of RAM - much smaller than
  most HBase installs, I know.
 
  When I first start my HBase processes I get about 60 seconds of fast
  performance. Hbase reads quickly and uses a healthy portion CPU cycles.
  After a minute or so, though, HBase slows dramatically. Reads sink to a
  glacial pace, and the CPU sits mostly idle.
 
  I notice this pattern when I run Nutch - particularly during read-heavy
  operations - but also when I run a simple row counter from the shell.
 
  At the moment  count 'my_table'  takes almost 4 hours to read through
 500
  000 rows. The reading is much faster at the start than the end.  In the
  first 30 seconds, HBase counts 37000 rows, but in the 30 seconds between
  8:00 and 8:30, only 1000 are counted.
 
  Looking through my Ganglia report I see a brief return to high
 performance
  around 3 hours into the count. I don't know what's causing this spike.
 
 
  Can anyone suggest what configuration parameters I should change to
 improve
  read performance?  Or what reference materials I should consult to
 better
  understand the problem?  Again, I'm totally new to HBase.
 
  I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5
 Gigs.
 
  Here's a Ganglia report covering the 4 hours of  count 'my_table' :
  http://imgur.com/Aa3eukZ
 
  Please let me know if I can provide any more information.
 
  Many thanks,
 
 
  Dave
 




Re: Low CPU usage and slow reads in pseudo-distributed mode - how to fix?

2015-01-11 Thread Dave Benson
St.Ack, Michael and Ted - thanks for your responses.


On Sun, Jan 11, 2015 at 1:38 PM, Stack st...@duboce.net wrote:

 Dave:

 As Michael suggests, you seem to be swapping going by your ganglia graph
 (the purple squiggles that often go above the 4G mark in the top right-hand
 memory graph).  Swapping will put a stake in your throughput.  Try lowering
 thresholds so you are not swapping.  Stuff should run a little smoother.
 See http://hbase.apache.org/book.html#perf.os.swap


St.Ack - I hope you can forgive the question, but which thresholds should
I be lowering, exactly? Do you mean that I should decrease the HBase
heapsize until RAM usage stays below 4G?

Thanks,


Dave







 St.Ack

 On Sun, Jan 11, 2015 at 6:49 AM, Michael Segel michael_se...@hotmail.com
 wrote:

  @Ted,
  Pseudo cluster on a machine that has 4GB of memory.
  If you give HBase 1.5GB for the region server… you are left with 2.5 GB
 of
  memory for everything else.
  You will swap.
 
  In short, nothing he can do will help. He’s screwed if he is trying to
  look improving performance.
 
 
  On Jan 11, 2015, at 12:19 AM, Ted Yu yuzhih...@gmail.com wrote:
 
   Please see http://hbase.apache.org/book.html#perf.reading
  
   I guess you use 0.90.4 because of Nutch integration. Still 0.90.x was
 way
   too old.
  
   bq. HBase has a heapsize of 1.5 Gigs
  
   This is not enough memory for good read performance. Please consider
  giving
   HBase more heap.
  
   Cheers
  
  
   On Sat, Jan 10, 2015 at 4:04 PM, Dave Benson davehben...@gmail.com
  wrote:
  
   Hi HBase users,
  
   I'm working HBase for the first time and I'm trying to sort out a
   performance issue. HBase is the data store for a small, focused web
  crawl
   I'm performing with Apache Nutch. I'm running in pseudo-distributed
  mode,
   meaning that Nutch, HBase and Hadoop are all on the same machine. The
   machine's a few years old and has only 4 gigs of RAM - much smaller
 than
   most HBase installs, I know.
  
   When I first start my HBase processes I get about 60 seconds of fast
   performance. Hbase reads quickly and uses a healthy portion CPU
 cycles.
   After a minute or so, though, HBase slows dramatically. Reads sink to
 a
   glacial pace, and the CPU sits mostly idle.
  
   I notice this pattern when I run Nutch - particularly during
 read-heavy
   operations - but also when I run a simple row counter from the shell.
  
   At the moment  count 'my_table'  takes almost 4 hours to read
 through
  500
   000 rows. The reading is much faster at the start than the end.  In
 the
   first 30 seconds, HBase counts 37000 rows, but in the 30 seconds
 between
   8:00 and 8:30, only 1000 are counted.
  
   Looking through my Ganglia report I see a brief return to high
  performance
   around 3 hours into the count. I don't know what's causing this spike.
  
  
   Can anyone suggest what configuration parameters I should change to
  improve
   read performance?  Or what reference materials I should consult to
  better
   understand the problem?  Again, I'm totally new to HBase.
  
   I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5
  Gigs.
  
   Here's a Ganglia report covering the 4 hours of  count 'my_table' :
   http://imgur.com/Aa3eukZ
  
   Please let me know if I can provide any more information.
  
   Many thanks,
  
  
   Dave
  
 
 



Re: Low CPU usage and slow reads in pseudo-distributed mode - how to fix?

2015-01-11 Thread Stack
On Sun, Jan 11, 2015 at 1:35 PM, Dave Benson davehben...@gmail.com wrote:

 St.Ack, Michael and Ted - thanks for your responses.

 ...

 St.Ack - I hope you can forgive the question, but which thresholds should
 I be lowering, exactly? Do you mean that I should decrease the HBase
 heapsize until RAM usage stays below 4G?


Do whatever it takes to stop the swapping (Buying an extra DRAM stick to
put in the machine might be your best bet).  Setting swappyness down low --
0 or 1% or so -- would help though it looks on your system that swappyness
is already set low.  Down the HBase heap size yeah -- 1G or lower -- and
any other process heaps you have running proportionally (e.g. cut down your
nutch MR task heap size allocations too -- if you can get away with it and
have them still run to completion).

You should be able to get a basic system running. It will likely be i/o
bound since you are not able to cache much in the HBase heap since you have
so little.

St.Ack


Low CPU usage and slow reads in pseudo-distributed mode - how to fix?

2015-01-10 Thread Dave Benson
Hi HBase users,

I'm working HBase for the first time and I'm trying to sort out a
performance issue. HBase is the data store for a small, focused web crawl
I'm performing with Apache Nutch. I'm running in pseudo-distributed mode,
meaning that Nutch, HBase and Hadoop are all on the same machine. The
machine's a few years old and has only 4 gigs of RAM - much smaller than
most HBase installs, I know.

When I first start my HBase processes I get about 60 seconds of fast
performance. Hbase reads quickly and uses a healthy portion CPU cycles.
After a minute or so, though, HBase slows dramatically. Reads sink to a
glacial pace, and the CPU sits mostly idle.

I notice this pattern when I run Nutch - particularly during read-heavy
operations - but also when I run a simple row counter from the shell.

At the moment  count 'my_table'  takes almost 4 hours to read through 500
000 rows. The reading is much faster at the start than the end.  In the
first 30 seconds, HBase counts 37000 rows, but in the 30 seconds between
8:00 and 8:30, only 1000 are counted.

Looking through my Ganglia report I see a brief return to high performance
around 3 hours into the count. I don't know what's causing this spike.


Can anyone suggest what configuration parameters I should change to improve
read performance?  Or what reference materials I should consult to better
understand the problem?  Again, I'm totally new to HBase.

I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5 Gigs.

Here's a Ganglia report covering the 4 hours of  count 'my_table' :
http://imgur.com/Aa3eukZ

Please let me know if I can provide any more information.

Many thanks,


Dave


Re: Low CPU usage and slow reads in pseudo-distributed mode - how to fix?

2015-01-10 Thread Ted Yu
Please see http://hbase.apache.org/book.html#perf.reading

I guess you use 0.90.4 because of Nutch integration. Still 0.90.x was way
too old.

bq. HBase has a heapsize of 1.5 Gigs

This is not enough memory for good read performance. Please consider giving
HBase more heap.

Cheers


On Sat, Jan 10, 2015 at 4:04 PM, Dave Benson davehben...@gmail.com wrote:

 Hi HBase users,

 I'm working HBase for the first time and I'm trying to sort out a
 performance issue. HBase is the data store for a small, focused web crawl
 I'm performing with Apache Nutch. I'm running in pseudo-distributed mode,
 meaning that Nutch, HBase and Hadoop are all on the same machine. The
 machine's a few years old and has only 4 gigs of RAM - much smaller than
 most HBase installs, I know.

 When I first start my HBase processes I get about 60 seconds of fast
 performance. Hbase reads quickly and uses a healthy portion CPU cycles.
 After a minute or so, though, HBase slows dramatically. Reads sink to a
 glacial pace, and the CPU sits mostly idle.

 I notice this pattern when I run Nutch - particularly during read-heavy
 operations - but also when I run a simple row counter from the shell.

 At the moment  count 'my_table'  takes almost 4 hours to read through 500
 000 rows. The reading is much faster at the start than the end.  In the
 first 30 seconds, HBase counts 37000 rows, but in the 30 seconds between
 8:00 and 8:30, only 1000 are counted.

 Looking through my Ganglia report I see a brief return to high performance
 around 3 hours into the count. I don't know what's causing this spike.


 Can anyone suggest what configuration parameters I should change to improve
 read performance?  Or what reference materials I should consult to better
 understand the problem?  Again, I'm totally new to HBase.

 I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5 Gigs.

 Here's a Ganglia report covering the 4 hours of  count 'my_table' :
 http://imgur.com/Aa3eukZ

 Please let me know if I can provide any more information.

 Many thanks,


 Dave