Re: Performance at large number of regions/node

2010-06-02 Thread Jacob Isaac
On Wed, Jun 2, 2010 at 9:55 AM, Stack wrote: > On Wed, Jun 2, 2010 at 9:39 AM, Jacob Isaac wrote: > > That's right sha-1 is my set identifier. > > and yes I scan since I know the start-row and end-row - I create a Scan > > object with startRow and stopRow. > > > > > > Good. > > >> > >> > The Fai

Re: Performance at large number of regions/node

2010-06-02 Thread Stack
On Wed, Jun 2, 2010 at 9:39 AM, Jacob Isaac wrote: > That's right sha-1 is my set identifier. > and yes I scan since I know the start-row and end-row - I create a Scan > object with startRow and stopRow. > > Good. >> >> > The Failed openScanner messages seems to suggest  some region name cache >

Re: Performance at large number of regions/node

2010-06-02 Thread Jacob Isaac
On Mon, May 31, 2010 at 8:37 AM, Stack wrote: > On Sun, May 30, 2010 at 9:22 AM, Jacob Isaac wrote: > > On Sun, May 30, 2010 at 7:04 AM, Stack wrote: > > Our writes and reads are pretty random (we rely on HBase handling the > > distribution) > > except that we read a set almost immediately aft

Re: Performance at large number of regions/node

2010-06-02 Thread Vidhyashankar Venkataraman
ataraman [mailto:vidhy...@yahoo-inc.com] > Sent: Tuesday, June 01, 2010 4:21 PM > To: user@hbase.apache.org > Subject: Re: Performance at large number of regions/node > > I have a related question: I tried a simple load experiment too using > Hbase's Java API.. (The nodes do o

RE: Performance at large number of regions/node

2010-06-01 Thread Jonathan Gray
tion/split/flush improvements being worked on. JG > -Original Message- > From: Vidhyashankar Venkataraman [mailto:vidhy...@yahoo-inc.com] > Sent: Tuesday, June 01, 2010 4:21 PM > To: user@hbase.apache.org > Subject: Re: Performance at large number of regions/node > > I h

Re: Performance at large number of regions/node

2010-06-01 Thread Vidhyashankar Venkataraman
I have a related question: I tried a simple load experiment too using Hbase's Java API.. (The nodes do only loading: nothing else.. The client programs generate random data on the fly to load.. So, no reads of the input data).. 120m rows 15KB each. 2 column families. 5 region servers, ran around

Re: Performance at large number of regions/node

2010-05-31 Thread Stack
On Sun, May 30, 2010 at 9:22 AM, Jacob Isaac wrote: > On Sun, May 30, 2010 at 7:04 AM, Stack wrote: > Our writes and reads are pretty random (we rely on HBase handling the > distribution) > except  that we read a set almost immediately after it written. > > Since our gets is for a set  - we are s

Re: Performance at large number of regions/node

2010-05-30 Thread Jacob Isaac
On Sun, May 30, 2010 at 7:08 AM, Stack wrote: > On Sat, May 29, 2010 at 6:36 PM, Jacob Isaac wrote: > > The metrics from my run indicate that I achieve around > > for writes - > > around 1 row(5k) in 2ms => 500 rows(5K) in 1 sec => 2.5 Mb/sec > > > > and from your the observation at StumbleUpon

Re: Performance at large number of regions/node

2010-05-30 Thread Jacob Isaac
On Sun, May 30, 2010 at 7:04 AM, Stack wrote: > On Sat, May 29, 2010 at 5:52 PM, Jacob Isaac wrote: > > Wow !! That's almost twice the throughput I got with less that 1/4 the > > cluster size. > > > I'm just writing. > > That is true. And I hear reading is not as efficient as writing? > > The

Re: Performance at large number of regions/node

2010-05-30 Thread Stack
On Sat, May 29, 2010 at 6:36 PM, Jacob Isaac wrote: > The metrics from my run indicate that I achieve around > for writes - > around 1 row(5k) in 2ms => 500 rows(5K) in 1 sec => 2.5 Mb/sec > > and from your the observation at StumbleUpon > > 200k rows (presuming 100 bytes per row)/sec  => 20Mb/sec

Re: Performance at large number of regions/node

2010-05-30 Thread Stack
On Sat, May 29, 2010 at 5:52 PM, Jacob Isaac wrote: > Wow !! That's almost twice the throughput I got with less that 1/4 the > cluster size. > I'm just writing. > The general flow of the loading program is > > 1. Reading/processing data from source (a local file on the machine) > 2. Writing data

Re: Performance at large number of regions/node

2010-05-29 Thread Jacob Isaac
Hi J-D We have 8 drives (~500G per drive - total 4G) per machine The metrics from my run indicate that I achieve around for writes - around 1 row(5k) in 2ms => 500 rows(5K) in 1 sec => 2.5 Mb/sec and from your the observation at StumbleUpon 200k rows (presuming 100 bytes per row)/sec => 20Mb/s

Re: Performance at large number of regions/node

2010-05-29 Thread Jacob Isaac
Wow !! That's almost twice the throughput I got with less that 1/4 the cluster size. The general flow of the loading program is 1. Reading/processing data from source (a local file on the machine) 2. Writing data to HBase 3. Reading the data from HBase and processing it. steps 1 and 2 happen on

Re: Performance at large number of regions/node

2010-05-29 Thread Stack
On Sat, May 29, 2010 at 10:53 AM, Stack wrote: > On Fri, May 28, 2010 at 4:11 PM, Jacob Isaac wrote: >> Here is the summary of the runs >> >> puts (~4-5k per row) >> regionsize #rows       Total time (ms) >> 1G 82282053*2      301943742 >> 512M 82287593*2      313119378 >> 256M 82246314*2      43

Re: Performance at large number of regions/node

2010-05-29 Thread Stack
On Fri, May 28, 2010 at 4:11 PM, Jacob Isaac wrote: > Here is the summary of the runs > > puts (~4-5k per row) > regionsize #rows       Total time (ms) > 1G 82282053*2      301943742 > 512M 82287593*2      313119378 > 256M 82246314*2      433200105 > So about 0.3ms per 5k write (presuming 100M wr

Re: Performance at large number of regions/node

2010-05-28 Thread Jean-Daniel Cryans
> What I wanted out of this discussion was to find out whether I am in the > ballpark of what I can juice out of HBase or I am way off the mark. > I understand... but this is a distributed system we're talking about. Unless I have the same code, hbase/hadoop version, configuration, number of nodes

Re: Performance at large number of regions/node

2010-05-28 Thread ja...@ebrary.com
Our data can be characterized as a list of sets and 1 row == element of a set. Our puts and gets work on a set at a time. Our sets typically range from 1~1000 elements and few can range from (1k-20k) elements. Can't guarantee it is a perfect codebase but do use HTablePool for reusing HTa

Re: Performance at large number of regions/node

2010-05-28 Thread Jean-Daniel Cryans
Looks like you spend 1/6 of your time doing the gets, good to know. For autoflush=false, if you fit the 4-5KB in a single Put, then it won't help as 1 put = 1 rpc. Batching them together almost always improve performance. The default buffer size is 2MB btw. LZO should give you another big boost,

Re: Performance at large number of regions/node

2010-05-28 Thread Jacob Isaac
Here is the summary of the runs puts (~4-5k per row) regionsize #rows Total time (ms) 1G 82282053*2 301943742 512M 82287593*2 313119378 256M 82246314*2 433200105 gets ((~4-5k per row) regionsize #rows Total time (ms) 1G 8242768590116726 512M 8242194394878

Re: Performance at large number of regions/node

2010-05-28 Thread Jacob Isaac
Vidhya - This is using HBase API. J-D - I do have timing info for inserts and gets - Let me process the data and will post the results. ~Jacob. On Fri, May 28, 2010 at 1:16 PM, Vidhyashankar Venkataraman < vidhy...@yahoo-inc.com> wrote: > Jacob, > Just curious: Is your observed upload throug

Re: Performance at large number of regions/node

2010-05-28 Thread Vidhyashankar Venkataraman
Jacob, Just curious: Is your observed upload throughput that of bulk importing or using the Hbase API? Thanks Vidhya On 5/28/10 1:13 PM, "Jacob Isaac" wrote: Hi J-D The run was done on a reformatted hdfs. Disabling WAL is not an option for us bcos this will be our normal mode of operation

Re: Performance at large number of regions/node

2010-05-28 Thread Jean-Daniel Cryans
On Fri, May 28, 2010 at 1:13 PM, Jacob Isaac wrote: > Hi J-D > > hbase.regionserver.maxlogs was 256 although > hbase.regionserver.hlog.blocksize was the default. > > Did not use compression. And autoflush is default (true) You should, and if you are uploading in big batches then disable autoflush

Re: Performance at large number of regions/node

2010-05-28 Thread Jacob Isaac
Hi J-D The run was done on a reformatted hdfs. Disabling WAL is not an option for us bcos this will be our normal mode of operation and durability is important to us. It was poor choice of words - 'upload' by me - it is more like periodic/continous writes hbase.regionserver.maxlogs was 256 altho

Re: Performance at large number of regions/node

2010-05-28 Thread Jean-Daniel Cryans
If the table was already created, changing hbase.hregion.max.filesize and hbase.hregion.memstore.flush.size won't be considered, those are the default values for new tables. You can set it in the shell too, see the "alter" command. Also, did you restart HBase? Did you push the configs to all nodes

Re: Performance at large number of regions/node

2010-05-28 Thread Jacob Isaac
Did a run yesterday, posted the relevant parameters below. Did not see any difference in throughput or total run time (~9 hrs) I am consistently getting about 5k rows/sec, each row around ~4-5k using a 17 node Hbase on 20 node HDFS cluster How does it compare?? Can I juice it more? ~Jacob

Re: Performance at large number of regions/node

2010-05-28 Thread Jean-Daniel Cryans
Like I said in my first email, it helps for random reading when lots of RAM is available to HBase. But it won't help the write throughput. J-D On Fri, May 28, 2010 at 10:12 AM, Vidhyashankar Venkataraman wrote: > I am not sure if I understood this right, but does changing > hfile.block.cache.si

Re: Performance at large number of regions/node

2010-05-28 Thread Vidhyashankar Venkataraman
I am not sure if I understood this right, but does changing hfile.block.cache.size also help? On 5/27/10 3:27 PM, "Jean-Daniel Cryans" wrote: Well we do have a couple of other configs for high write throughput: hbase.hstore.blockingStoreFiles 15 hbase.hregion.memstore.block.multiplie

Re: Performance at large number of regions/node

2010-05-27 Thread Jean-Daniel Cryans
Well we do have a couple of other configs for high write throughput: hbase.hstore.blockingStoreFiles 15 hbase.hregion.memstore.block.multiplier 8 hbase.regionserver.handler.count 60 hbase.regions.percheckin 100 The last one is for restarts. Uploading very fast, you will mo

Re: Performance at large number of regions/node

2010-05-27 Thread Jacob Isaac
Thanks J-D Currently we are trying to find/optimize our load/write times - although in prod we expect it to be 25/75 (writes/reads) ratio. We are using long table model with only one column - row-size is typically ~ 4-5k As to your suggestion on not using even 50% of disk space - I agree and was

Re: Performance at large number of regions/node

2010-05-27 Thread Jean-Daniel Cryans
With beefy nodes, don't be afraid of using bigger regions... and LZO. At stumbleupon we have 1GB maxfilesize on our >13B rows table and LZO enabled on every table. The number of regions per node is a factor of so many things... size of rows, acces pattern, hardware, etc. FWIW, I would say that you

Performance at large number of regions/node

2010-05-27 Thread Jacob Isaac
Hi Wanted to find the group's experience on HBase performance with increasing number of regions/node. Also wanted to find out if there is an optimal number of regions one should aim for? We are currently using 17 node HBase(0.20.4) cluster on a 20 node Hadoop(0.20.2) cluster 16G RAM per node, 4