Re: PreSplit the table with Long format

2013-02-19 Thread Farrokh Shahriari
Hello again,

Doesn't anyone know how I can do this.
The problem is:
When you insert something from the shell, it supposes it's a string and
then does a Bytes.toBytes conversion on the string and stores it in hbase.
So how can I tell the shell that the thing I'm entering isn't a string? How
I can put a value with a long format inside hbase, through the shell.

If you need to know, I want to pre-split my table. I can't do it through
java code, because I've installed a security library on hbase, with which I
can create encrypted tables. It adds a "securecreate" command to the shell,
from which I can create encrypted tables, but I can't create encrypted
tables with the java code. So I'm forced to use the shell to create the
table, and I want to pre-split my table with long values, because my row
keys are in the long format.

Please help, I really need this.
Thanks

On Tue, Feb 19, 2013 at 2:12 PM, Farrokh Shahriari <
mohandes.zebeleh...@gmail.com> wrote:

> Tnx for your help,but it doesn't work.Do you have any other idea,cause I
> must run it from the shell.
>
> Farrokh
>
>
>
> On Tue, Feb 19, 2013 at 1:30 PM, Viral Bajaria wrote:
>
>> HBase shell is a jruby shell and so you can invoke any java commands from
>> it.
>>
>> For example:
>> > import org.apache.hadoop.hbase.util.Bytes
>> > Bytes.toLong(Bytes.toBytes(1000))
>>
>> Not sure if this works as expected since I don't have a terminal in front
>> of me but you could try (assuming the SPLITS keyword takes byte array as
>> input, never used SPLITS from the command line):
>> create 'testTable', 'cf1' , { SPLITS => [ Bytes.toBytes(1000),
>> Bytes.toBytes(2000), Bytes.toBytes(3000) ] }
>>
>> Thanks,
>> Viral
>>
>> On Tue, Feb 19, 2013 at 1:52 AM, Farrokh Shahriari <
>> mohandes.zebeleh...@gmail.com> wrote:
>>
>> > Hi there
>> > As I use rowkey in long format,I must presplit table in long format
>> too.But
>> > when I've run this command,it presplit the table with STRING format :
>> > create 'testTable','cf1',{SPLITS => [ '1000','2000','3000']}
>> >
>> > How can I presplit the table with Long format ?
>> >
>> > Farrokh
>> >
>>
>
>


Re: availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread lars hofhansl
Time permitting, I will do that tomorrow.





 From: Andrew Purtell 
To: "user@hbase.apache.org"  
Sent: Tuesday, February 19, 2013 6:58 PM
Subject: Re: availability of 0.94.4 and 0.94.5 in maven repo?
 
Same here, just tripped over this moments ago.


On Tue, Feb 19, 2013 at 5:30 PM, James Taylor wrote:

> Unless I'm doing something wrong, it looks like the Maven repository (
> http://mvnrepository.com/**artifact/org.apache.hbase/**hbase)
> only contains HBase up to 0.94.3. Is there a different repo I should use,
> or if not, any ETA on when it'll be updated?
>
>     James
>
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: [resend] region server of -ROOT- table is dead, but not reassigned

2013-02-19 Thread ramkrishna vasudevan
Ideally the ROOT table should be reassigned once the RS carrying ROOT goes
down.  This should happen automatically.

May be what does your logs say.  That would give us an insight.

Before that if you can restart your master it may solve this problem.  Even
then if it persists try to delete the zk data and restart the cluster.

REgards
Ram

On Wed, Feb 20, 2013 at 11:06 AM, Lu, Wei  wrote:

> Hi, all,
>
>
>
> When I scan any table, I got:
>
>
>
> 13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020could not 
> be reached after 1 tries, giving up.
>
> ...
>
> ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> after attempts=7, exceptions:
>
> ...
>
>
>
> What I observe:
>
>
>
> 1)  -ROOT- table is on Region Server rs1
>
>
>
> Table Regions
>
> NameRegion Server   Start Key End Key
> Requests
>
> -ROOT-  Rs1:60020   -   -
>
>
>
>
>
> 2)  But the region server rs1 is dead
>
>
>
> Dead Region Servers
>
>   ServerName
>
>   Rs4,60020,1361109702535
>
>   Rs1,60020,1361109710150
>
> Total:servers: 2
>
>
>
>
>
> 
>
> Does it mean that the region server holding the -ROOT- table is dead, but
> the -ROOT- region is not reassigned to any other region servers?
>
> Why?
>
>
>
> By the way, the hbase version I am using is 0.92.1-cdh4.0.1
>
>
>
> Thanks,
>
> Wei
>
>


RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
You mean slop is also base on per table?
Weird, then it should work for my case let me check again.

Best Regards,
Raymond Liu

> 
> bq. On a 3000 region cluster
> 
> Balancing is per-table. Meaning total number of regions doesn't come into 
> play.
> 
> On Tue, Feb 19, 2013 at 7:55 PM, Liu, Raymond 
> wrote:
> 
> > Hmm, in order to have the 96 region table be balanced within 20% On a
> > 3000 region cluster when all other table is balanced.
> >
> > the slop will need to be around 20%/30, say 0.006? won't it be too small?
> >
> > >
> > > Yes, Raymond.
> > > You should lower sloppiness.
> > >
> > > On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond
> > > 
> > > wrote:
> > >
> > > > I mean region number is small.
> > > >
> > > > Overall I have say 3000 region on 4 node, while this table only
> > > > have
> > > > 96 region. It won't be 24 for each region server, instead , will
> > > > be something like 19/30/23/21 etc.
> > > >
> > > > This means that I need to limit the slop to 0.02 etc? so that the
> > > > balancer actually run on this table?
> > > >
> > > > Best Regards,
> > > > Raymond Liu
> > > >
> > > > From: Marcos Ortiz [mailto:mlor...@uci.cu]
> > > > Sent: Wednesday, February 20, 2013 11:44 AM
> > > > To: user@hbase.apache.org
> > > > Cc: Liu, Raymond
> > > > Subject: Re: Is there any way to balance one table?
> > > >
> > > > What is the size of your table?
> > > > On 02/19/2013 10:40 PM, Liu, Raymond wrote:
> > > > Hi
> > > >
> > > > I do call balancer, while it seems it doesn't work. Might due to
> > > > this table is small and overall region number difference is within
> > threshold?
> > > >
> > > > -Original Message-
> > > > From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
> > > > Sent: Wednesday, February 20, 2013 10:59 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Is there any way to balance one table?
> > > >
> > > > Hi Liu,
> > > >
> > > > Why did not you simply called the balancer? If other tables are
> > > > already balanced, it should not touch them and will only balance
> > > > the table which is not balancer?
> > > >
> > > > JM
> > > >
> > > > 2013/2/19, Liu, Raymond :
> > > > I choose to move region manually. Any other approaching?
> > > >
> > > >
> > > > 0.94.1
> > > >
> > > > Any cmd in shell? Or I need to change balance threshold to 0 an
> > > > run global balancer cmd in shell?
> > > >
> > > >
> > > >
> > > > Best Regards,
> > > > Raymond Liu
> > > >
> > > > -Original Message-
> > > > From: Ted Yu [mailto:yuzhih...@gmail.com]
> > > > Sent: Wednesday, February 20, 2013 9:09 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Is there any way to balance one table?
> > > >
> > > > What version of HBase are you using ?
> > > >
> > > > 0.94 has per-table load balancing.
> > > >
> > > > Cheers
> > > >
> > > > On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
> > > > 
> > > > wrote:
> > > >
> > > > Hi
> > > >
> > > > Is there any way to balance just one table? I found one of my
> > > > table is not balanced, while all the other table is balanced. So I
> > > > want to fix this table.
> > > >
> > > > Best Regards,
> > > > Raymond Liu
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Marcos Ortiz Valmaseda,
> > > > Product Manager && Data Scientist at UCI
> > > > Blog: http://marcosluis2186.posterous.com
> > > > Twitter: @marcosluis2186
> > > >
> >


[resend] region server of -ROOT- table is dead, but not reassigned

2013-02-19 Thread Lu, Wei
Hi, all,



When I scan any table, I got:



13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020 could not 
be reached after 1 tries, giving up.

...

ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=7, exceptions:

...



What I observe:



1)  -ROOT- table is on Region Server rs1



Table Regions

NameRegion Server   Start Key End Key   Requests

-ROOT-  Rs1:60020   -   -





2)  But the region server rs1 is dead



Dead Region Servers

  ServerName

  Rs4,60020,1361109702535

  Rs1,60020,1361109710150

Total:servers: 2







Does it mean that the region server holding the -ROOT- table is dead, but the 
-ROOT- region is not reassigned to any other region servers?

Why?



By the way, the hbase version I am using is 0.92.1-cdh4.0.1



Thanks,

Wei



RE: region server of -ROOT- table is dead, but not reassigned

2013-02-19 Thread Lu, Wei
By the way, the hbase version I am using is 0.92.1-cdh4.0.1

From: Lu, Wei
Sent: Wednesday, February 20, 2013 1:28 PM
To: user@hbase.apache.org
Subject: region server of -ROOT- table is dead, but not reassigned

Hi, all,

When I scan any table, I got:

13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020 could not 
be reached after 1 tries, giving up.
...
ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=7, exceptions:
...

What I observe:

1)  -ROOT- table is on Region Server rs1

Table Regions
Name

Region Server

Start Key

End Key

Requests

-ROOT-

Rs1:60020

-

-



2)  But the region server rs1 is dead

Dead Region Servers

ServerName

Rs4,60020,1361109702535

Rs1,60020,1361109710150

Total:

servers: 2






Does it mean that the region server holding the -ROOT- table is dead, but the 
-ROOT- region is not reassigned to any other region servers?
Why?

Thanks,
Wei








region server of -ROOT- table is dead, but not reassigned

2013-02-19 Thread Lu, Wei
Hi, all,

When I scan any table, I got:

13/02/20 05:16:45 INFO ipc.HBaseRPC: Server at Rs1/10.20.118.3:60020 could not 
be reached after 1 tries, giving up.
...
ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=7, exceptions:
...

What I observe:

1)  -ROOT- table is on Region Server rs1

Table Regions
Name

Region Server

Start Key

End Key

Requests

-ROOT-

Rs1:60020

-

-



2)  But the region server rs1 is dead

Dead Region Servers

ServerName

Rs4,60020,1361109702535

Rs1,60020,1361109710150

Total:

servers: 2






Does it mean that the region server holding the -ROOT- table is dead, but the 
-ROOT- region is not reassigned to any other region servers?
Why?

Thanks,
Wei








Re: Is there any way to balance one table?

2013-02-19 Thread Ted Yu
bq. On a 3000 region cluster

Balancing is per-table. Meaning total number of regions doesn't come into
play.

On Tue, Feb 19, 2013 at 7:55 PM, Liu, Raymond  wrote:

> Hmm, in order to have the 96 region table be balanced within 20% On a 3000
> region cluster when all other table is balanced.
>
> the slop will need to be around 20%/30, say 0.006? won't it be too small?
>
> >
> > Yes, Raymond.
> > You should lower sloppiness.
> >
> > On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond 
> > wrote:
> >
> > > I mean region number is small.
> > >
> > > Overall I have say 3000 region on 4 node, while this table only have
> > > 96 region. It won't be 24 for each region server, instead , will be
> > > something like 19/30/23/21 etc.
> > >
> > > This means that I need to limit the slop to 0.02 etc? so that the
> > > balancer actually run on this table?
> > >
> > > Best Regards,
> > > Raymond Liu
> > >
> > > From: Marcos Ortiz [mailto:mlor...@uci.cu]
> > > Sent: Wednesday, February 20, 2013 11:44 AM
> > > To: user@hbase.apache.org
> > > Cc: Liu, Raymond
> > > Subject: Re: Is there any way to balance one table?
> > >
> > > What is the size of your table?
> > > On 02/19/2013 10:40 PM, Liu, Raymond wrote:
> > > Hi
> > >
> > > I do call balancer, while it seems it doesn't work. Might due to this
> > > table is small and overall region number difference is within
> threshold?
> > >
> > > -Original Message-
> > > From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
> > > Sent: Wednesday, February 20, 2013 10:59 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: Is there any way to balance one table?
> > >
> > > Hi Liu,
> > >
> > > Why did not you simply called the balancer? If other tables are
> > > already balanced, it should not touch them and will only balance the
> > > table which is not balancer?
> > >
> > > JM
> > >
> > > 2013/2/19, Liu, Raymond :
> > > I choose to move region manually. Any other approaching?
> > >
> > >
> > > 0.94.1
> > >
> > > Any cmd in shell? Or I need to change balance threshold to 0 an run
> > > global balancer cmd in shell?
> > >
> > >
> > >
> > > Best Regards,
> > > Raymond Liu
> > >
> > > -Original Message-
> > > From: Ted Yu [mailto:yuzhih...@gmail.com]
> > > Sent: Wednesday, February 20, 2013 9:09 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: Is there any way to balance one table?
> > >
> > > What version of HBase are you using ?
> > >
> > > 0.94 has per-table load balancing.
> > >
> > > Cheers
> > >
> > > On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond 
> > > wrote:
> > >
> > > Hi
> > >
> > > Is there any way to balance just one table? I found one of my table is
> > > not balanced, while all the other table is balanced. So I want to fix
> > > this table.
> > >
> > > Best Regards,
> > > Raymond Liu
> > >
> > >
> > >
> > >
> > > --
> > > Marcos Ortiz Valmaseda,
> > > Product Manager && Data Scientist at UCI
> > > Blog: http://marcosluis2186.posterous.com
> > > Twitter: @marcosluis2186
> > >
>


RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
Hmm, in order to have the 96 region table be balanced within 20% On a 3000 
region cluster when all other table is balanced.

the slop will need to be around 20%/30, say 0.006? won't it be too small?

> 
> Yes, Raymond.
> You should lower sloppiness.
> 
> On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond 
> wrote:
> 
> > I mean region number is small.
> >
> > Overall I have say 3000 region on 4 node, while this table only have
> > 96 region. It won't be 24 for each region server, instead , will be
> > something like 19/30/23/21 etc.
> >
> > This means that I need to limit the slop to 0.02 etc? so that the
> > balancer actually run on this table?
> >
> > Best Regards,
> > Raymond Liu
> >
> > From: Marcos Ortiz [mailto:mlor...@uci.cu]
> > Sent: Wednesday, February 20, 2013 11:44 AM
> > To: user@hbase.apache.org
> > Cc: Liu, Raymond
> > Subject: Re: Is there any way to balance one table?
> >
> > What is the size of your table?
> > On 02/19/2013 10:40 PM, Liu, Raymond wrote:
> > Hi
> >
> > I do call balancer, while it seems it doesn't work. Might due to this
> > table is small and overall region number difference is within threshold?
> >
> > -Original Message-
> > From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
> > Sent: Wednesday, February 20, 2013 10:59 AM
> > To: user@hbase.apache.org
> > Subject: Re: Is there any way to balance one table?
> >
> > Hi Liu,
> >
> > Why did not you simply called the balancer? If other tables are
> > already balanced, it should not touch them and will only balance the
> > table which is not balancer?
> >
> > JM
> >
> > 2013/2/19, Liu, Raymond :
> > I choose to move region manually. Any other approaching?
> >
> >
> > 0.94.1
> >
> > Any cmd in shell? Or I need to change balance threshold to 0 an run
> > global balancer cmd in shell?
> >
> >
> >
> > Best Regards,
> > Raymond Liu
> >
> > -Original Message-
> > From: Ted Yu [mailto:yuzhih...@gmail.com]
> > Sent: Wednesday, February 20, 2013 9:09 AM
> > To: user@hbase.apache.org
> > Subject: Re: Is there any way to balance one table?
> >
> > What version of HBase are you using ?
> >
> > 0.94 has per-table load balancing.
> >
> > Cheers
> >
> > On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond 
> > wrote:
> >
> > Hi
> >
> > Is there any way to balance just one table? I found one of my table is
> > not balanced, while all the other table is balanced. So I want to fix
> > this table.
> >
> > Best Regards,
> > Raymond Liu
> >
> >
> >
> >
> > --
> > Marcos Ortiz Valmaseda,
> > Product Manager && Data Scientist at UCI
> > Blog: http://marcosluis2186.posterous.com
> > Twitter: @marcosluis2186
> >


Re: Is there any way to balance one table?

2013-02-19 Thread Ted Yu
Yes, Raymond.
You should lower sloppiness.

On Tue, Feb 19, 2013 at 7:48 PM, Liu, Raymond  wrote:

> I mean region number is small.
>
> Overall I have say 3000 region on 4 node, while this table only have 96
> region. It won't be 24 for each region server, instead , will be something
> like 19/30/23/21 etc.
>
> This means that I need to limit the slop to 0.02 etc? so that the balancer
> actually run on this table?
>
> Best Regards,
> Raymond Liu
>
> From: Marcos Ortiz [mailto:mlor...@uci.cu]
> Sent: Wednesday, February 20, 2013 11:44 AM
> To: user@hbase.apache.org
> Cc: Liu, Raymond
> Subject: Re: Is there any way to balance one table?
>
> What is the size of your table?
> On 02/19/2013 10:40 PM, Liu, Raymond wrote:
> Hi
>
> I do call balancer, while it seems it doesn't work. Might due to this
> table is small and overall region number difference is within threshold?
>
> -Original Message-
> From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
> Sent: Wednesday, February 20, 2013 10:59 AM
> To: user@hbase.apache.org
> Subject: Re: Is there any way to balance one table?
>
> Hi Liu,
>
> Why did not you simply called the balancer? If other tables are already
> balanced, it should not touch them and will only balance the table which
> is not
> balancer?
>
> JM
>
> 2013/2/19, Liu, Raymond :
> I choose to move region manually. Any other approaching?
>
>
> 0.94.1
>
> Any cmd in shell? Or I need to change balance threshold to 0 an run
> global balancer cmd in shell?
>
>
>
> Best Regards,
> Raymond Liu
>
> -Original Message-
> From: Ted Yu [mailto:yuzhih...@gmail.com]
> Sent: Wednesday, February 20, 2013 9:09 AM
> To: user@hbase.apache.org
> Subject: Re: Is there any way to balance one table?
>
> What version of HBase are you using ?
>
> 0.94 has per-table load balancing.
>
> Cheers
>
> On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
> 
> wrote:
>
> Hi
>
> Is there any way to balance just one table? I found one of my
> table is not balanced, while all the other table is balanced. So
> I want to fix this table.
>
> Best Regards,
> Raymond Liu
>
>
>
>
> --
> Marcos Ortiz Valmaseda,
> Product Manager && Data Scientist at UCI
> Blog: http://marcosluis2186.posterous.com
> Twitter: @marcosluis2186
>


RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
Yeah, Since balance is already done on each table, why slop is not calculate 
upon each table...

> 
> You're right. Default sloppiness is 20%:
> this.slop = conf.getFloat("hbase.regions.slop", (float) 0.2);
> src/main/java/org/apache/hadoop/hbase/master/DefaultLoadBalancer.java
> 
> Meaning, region count on any server can be as far as 20% from average region
> count.
> 
> You can tighten sloppiness.
> 
> On Tue, Feb 19, 2013 at 7:40 PM, Liu, Raymond 
> wrote:
> 
> > Hi
> >
> > I do call balancer, while it seems it doesn't work. Might due to this
> > table is small and overall region number difference is within threshold?
> >
> > > -Original Message-
> > > From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
> > > Sent: Wednesday, February 20, 2013 10:59 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: Is there any way to balance one table?
> > >
> > > Hi Liu,
> > >
> > > Why did not you simply called the balancer? If other tables are
> > > already balanced, it should not touch them and will only balance the
> > > table which
> > is not
> > > balancer?
> > >
> > > JM
> > >
> > > 2013/2/19, Liu, Raymond :
> > > > I choose to move region manually. Any other approaching?
> > > >
> > > >>
> > > >> 0.94.1
> > > >>
> > > >> Any cmd in shell? Or I need to change balance threshold to 0 an
> > > >> run global balancer cmd in shell?
> > > >>
> > > >>
> > > >>
> > > >> Best Regards,
> > > >> Raymond Liu
> > > >>
> > > >> > -Original Message-
> > > >> > From: Ted Yu [mailto:yuzhih...@gmail.com]
> > > >> > Sent: Wednesday, February 20, 2013 9:09 AM
> > > >> > To: user@hbase.apache.org
> > > >> > Subject: Re: Is there any way to balance one table?
> > > >> >
> > > >> > What version of HBase are you using ?
> > > >> >
> > > >> > 0.94 has per-table load balancing.
> > > >> >
> > > >> > Cheers
> > > >> >
> > > >> > On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
> > > >> > 
> > > >> > wrote:
> > > >> >
> > > >> > > Hi
> > > >> > >
> > > >> > > Is there any way to balance just one table? I found one of my
> > > >> > > table is not balanced, while all the other table is balanced.
> > > >> > > So I want to fix this table.
> > > >> > >
> > > >> > > Best Regards,
> > > >> > > Raymond Liu
> > > >> > >
> > > >> > >
> > > >
> >


RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
I mean region number is small.

Overall I have say 3000 region on 4 node, while this table only have 96 region. 
It won't be 24 for each region server, instead , will be something like 
19/30/23/21 etc.

This means that I need to limit the slop to 0.02 etc? so that the balancer 
actually run on this table?

Best Regards,
Raymond Liu

From: Marcos Ortiz [mailto:mlor...@uci.cu] 
Sent: Wednesday, February 20, 2013 11:44 AM
To: user@hbase.apache.org
Cc: Liu, Raymond
Subject: Re: Is there any way to balance one table?

What is the size of your table?
On 02/19/2013 10:40 PM, Liu, Raymond wrote:
Hi

I do call balancer, while it seems it doesn't work. Might due to this table is 
small and overall region number difference is within threshold?

-Original Message-
From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
Sent: Wednesday, February 20, 2013 10:59 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?

Hi Liu,

Why did not you simply called the balancer? If other tables are already
balanced, it should not touch them and will only balance the table which is not
balancer?

JM

2013/2/19, Liu, Raymond :
I choose to move region manually. Any other approaching?


0.94.1

Any cmd in shell? Or I need to change balance threshold to 0 an run
global balancer cmd in shell?



Best Regards,
Raymond Liu

-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, February 20, 2013 9:09 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?

What version of HBase are you using ?

0.94 has per-table load balancing.

Cheers

On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond

wrote:

Hi

Is there any way to balance just one table? I found one of my
table is not balanced, while all the other table is balanced. So
I want to fix this table.

Best Regards,
Raymond Liu




-- 
Marcos Ortiz Valmaseda, 
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186


Re: Is there any way to balance one table?

2013-02-19 Thread Ted Yu
You're right. Default sloppiness is 20%:
this.slop = conf.getFloat("hbase.regions.slop", (float) 0.2);
src/main/java/org/apache/hadoop/hbase/master/DefaultLoadBalancer.java

Meaning, region count on any server can be as far as 20% from average
region count.

You can tighten sloppiness.

On Tue, Feb 19, 2013 at 7:40 PM, Liu, Raymond  wrote:

> Hi
>
> I do call balancer, while it seems it doesn't work. Might due to this
> table is small and overall region number difference is within threshold?
>
> > -Original Message-
> > From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
> > Sent: Wednesday, February 20, 2013 10:59 AM
> > To: user@hbase.apache.org
> > Subject: Re: Is there any way to balance one table?
> >
> > Hi Liu,
> >
> > Why did not you simply called the balancer? If other tables are already
> > balanced, it should not touch them and will only balance the table which
> is not
> > balancer?
> >
> > JM
> >
> > 2013/2/19, Liu, Raymond :
> > > I choose to move region manually. Any other approaching?
> > >
> > >>
> > >> 0.94.1
> > >>
> > >> Any cmd in shell? Or I need to change balance threshold to 0 an run
> > >> global balancer cmd in shell?
> > >>
> > >>
> > >>
> > >> Best Regards,
> > >> Raymond Liu
> > >>
> > >> > -Original Message-
> > >> > From: Ted Yu [mailto:yuzhih...@gmail.com]
> > >> > Sent: Wednesday, February 20, 2013 9:09 AM
> > >> > To: user@hbase.apache.org
> > >> > Subject: Re: Is there any way to balance one table?
> > >> >
> > >> > What version of HBase are you using ?
> > >> >
> > >> > 0.94 has per-table load balancing.
> > >> >
> > >> > Cheers
> > >> >
> > >> > On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
> > >> > 
> > >> > wrote:
> > >> >
> > >> > > Hi
> > >> > >
> > >> > > Is there any way to balance just one table? I found one of my
> > >> > > table is not balanced, while all the other table is balanced. So
> > >> > > I want to fix this table.
> > >> > >
> > >> > > Best Regards,
> > >> > > Raymond Liu
> > >> > >
> > >> > >
> > >
>


Re: Is there any way to balance one table?

2013-02-19 Thread Marcos Ortiz

What is the size of your table?

On 02/19/2013 10:40 PM, Liu, Raymond wrote:

Hi

I do call balancer, while it seems it doesn't work. Might due to this table is 
small and overall region number difference is within threshold?


-Original Message-
From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
Sent: Wednesday, February 20, 2013 10:59 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?

Hi Liu,

Why did not you simply called the balancer? If other tables are already
balanced, it should not touch them and will only balance the table which is not
balancer?

JM

2013/2/19, Liu, Raymond :

I choose to move region manually. Any other approaching?


0.94.1

Any cmd in shell? Or I need to change balance threshold to 0 an run
global balancer cmd in shell?



Best Regards,
Raymond Liu


-Original Message-
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, February 20, 2013 9:09 AM
To: user@hbase.apache.org
Subject: Re: Is there any way to balance one table?

What version of HBase are you using ?

0.94 has per-table load balancing.

Cheers

On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond

wrote:


Hi

Is there any way to balance just one table? I found one of my
table is not balanced, while all the other table is balanced. So
I want to fix this table.

Best Regards,
Raymond Liu




--
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 


RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
Hi

I do call balancer, while it seems it doesn't work. Might due to this table is 
small and overall region number difference is within threshold?

> -Original Message-
> From: Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org]
> Sent: Wednesday, February 20, 2013 10:59 AM
> To: user@hbase.apache.org
> Subject: Re: Is there any way to balance one table?
> 
> Hi Liu,
> 
> Why did not you simply called the balancer? If other tables are already
> balanced, it should not touch them and will only balance the table which is 
> not
> balancer?
> 
> JM
> 
> 2013/2/19, Liu, Raymond :
> > I choose to move region manually. Any other approaching?
> >
> >>
> >> 0.94.1
> >>
> >> Any cmd in shell? Or I need to change balance threshold to 0 an run
> >> global balancer cmd in shell?
> >>
> >>
> >>
> >> Best Regards,
> >> Raymond Liu
> >>
> >> > -Original Message-
> >> > From: Ted Yu [mailto:yuzhih...@gmail.com]
> >> > Sent: Wednesday, February 20, 2013 9:09 AM
> >> > To: user@hbase.apache.org
> >> > Subject: Re: Is there any way to balance one table?
> >> >
> >> > What version of HBase are you using ?
> >> >
> >> > 0.94 has per-table load balancing.
> >> >
> >> > Cheers
> >> >
> >> > On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond
> >> > 
> >> > wrote:
> >> >
> >> > > Hi
> >> > >
> >> > > Is there any way to balance just one table? I found one of my
> >> > > table is not balanced, while all the other table is balanced. So
> >> > > I want to fix this table.
> >> > >
> >> > > Best Regards,
> >> > > Raymond Liu
> >> > >
> >> > >
> >


Re: Is there any way to balance one table?

2013-02-19 Thread Ted Yu
HBASE-3373 introduced "hbase.master.loadbalance.bytable" which defaults to
true.

This means when you issue 'balancer' command in shell, table should be
balanced for you.

Cheers

On Tue, Feb 19, 2013 at 5:16 PM, Liu, Raymond  wrote:

> 0.94.1
>
> Any cmd in shell? Or I need to change balance threshold to 0 an run global
> balancer cmd in shell?
>
>
>
> Best Regards,
> Raymond Liu
>
> > -Original Message-
> > From: Ted Yu [mailto:yuzhih...@gmail.com]
> > Sent: Wednesday, February 20, 2013 9:09 AM
> > To: user@hbase.apache.org
> > Subject: Re: Is there any way to balance one table?
> >
> > What version of HBase are you using ?
> >
> > 0.94 has per-table load balancing.
> >
> > Cheers
> >
> > On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond 
> > wrote:
> >
> > > Hi
> > >
> > > Is there any way to balance just one table? I found one of my table is
> > > not balanced, while all the other table is balanced. So I want to fix
> > > this table.
> > >
> > > Best Regards,
> > > Raymond Liu
> > >
> > >
>


Problem In Understanding Compaction Process

2013-02-19 Thread Anty
Hi: Guys

  I have some problem in understanding the compaction process, Can
someone shed some light on me, much appreciate. Here is the problem:

  Region Server after successfully generate the final compacted file,
it going through two steps:
   1. move the above compacted file into region's directory
   2. delete replaced files.

   the above two steps are not atomic, if Region Server crash after
step1, and  before step2, then there are duplication records!  Is this
problem handled  in reading process , or there is another mechanism to fix
this?

-- 
Best Regards
Anty Rao


Re: Is there any way to balance one table?

2013-02-19 Thread Jean-Marc Spaggiari
Hi Liu,

Why did not you simply called the balancer? If other tables are
already balanced, it should not touch them and will only balance the
table which is not balancer?

JM

2013/2/19, Liu, Raymond :
> I choose to move region manually. Any other approaching?
>
>>
>> 0.94.1
>>
>> Any cmd in shell? Or I need to change balance threshold to 0 an run
>> global
>> balancer cmd in shell?
>>
>>
>>
>> Best Regards,
>> Raymond Liu
>>
>> > -Original Message-
>> > From: Ted Yu [mailto:yuzhih...@gmail.com]
>> > Sent: Wednesday, February 20, 2013 9:09 AM
>> > To: user@hbase.apache.org
>> > Subject: Re: Is there any way to balance one table?
>> >
>> > What version of HBase are you using ?
>> >
>> > 0.94 has per-table load balancing.
>> >
>> > Cheers
>> >
>> > On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond 
>> > wrote:
>> >
>> > > Hi
>> > >
>> > > Is there any way to balance just one table? I found one of my table
>> > > is not balanced, while all the other table is balanced. So I want to
>> > > fix this table.
>> > >
>> > > Best Regards,
>> > > Raymond Liu
>> > >
>> > >
>


Re: availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread Andrew Purtell
Same here, just tripped over this moments ago.


On Tue, Feb 19, 2013 at 5:30 PM, James Taylor wrote:

> Unless I'm doing something wrong, it looks like the Maven repository (
> http://mvnrepository.com/**artifact/org.apache.hbase/**hbase)
> only contains HBase up to 0.94.3. Is there a different repo I should use,
> or if not, any ETA on when it'll be updated?
>
> James
>
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)


RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
I choose to move region manually. Any other approaching?

> 
> 0.94.1
> 
> Any cmd in shell? Or I need to change balance threshold to 0 an run global
> balancer cmd in shell?
> 
> 
> 
> Best Regards,
> Raymond Liu
> 
> > -Original Message-
> > From: Ted Yu [mailto:yuzhih...@gmail.com]
> > Sent: Wednesday, February 20, 2013 9:09 AM
> > To: user@hbase.apache.org
> > Subject: Re: Is there any way to balance one table?
> >
> > What version of HBase are you using ?
> >
> > 0.94 has per-table load balancing.
> >
> > Cheers
> >
> > On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond 
> > wrote:
> >
> > > Hi
> > >
> > > Is there any way to balance just one table? I found one of my table
> > > is not balanced, while all the other table is balanced. So I want to
> > > fix this table.
> > >
> > > Best Regards,
> > > Raymond Liu
> > >
> > >


Re: availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread Joarder KAMAL
I also came up with the same issue 1 day ago while building YCSB HBase
client for HBase 0.94.5. Later I used the 0.94.3 version to carry out my
work for the time being.

Regards,
Joarder Kamal


On 20 February 2013 12:32, Viral Bajaria  wrote:

> I have come across this too, I think someone with authorization needs to
> perform a maven release to the apache maven repository and/or maven
> central.
>
> For now, I just end up compiling the dot release from trunk and deploy it
> to my local repository for other projects to use.
>
> Thanks,
> Viral
>
> On Tue, Feb 19, 2013 at 5:30 PM, James Taylor  >wrote:
>
> > Unless I'm doing something wrong, it looks like the Maven repository (
> > http://mvnrepository.com/**artifact/org.apache.hbase/**hbase<
> http://mvnrepository.com/artifact/org.apache.hbase/hbase>)
> > only contains HBase up to 0.94.3. Is there a different repo I should use,
> > or if not, any ETA on when it'll be updated?
> >
> > James
> >
> >
>


Re: availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread Viral Bajaria
I have come across this too, I think someone with authorization needs to
perform a maven release to the apache maven repository and/or maven central.

For now, I just end up compiling the dot release from trunk and deploy it
to my local repository for other projects to use.

Thanks,
Viral

On Tue, Feb 19, 2013 at 5:30 PM, James Taylor wrote:

> Unless I'm doing something wrong, it looks like the Maven repository (
> http://mvnrepository.com/**artifact/org.apache.hbase/**hbase)
> only contains HBase up to 0.94.3. Is there a different repo I should use,
> or if not, any ETA on when it'll be updated?
>
> James
>
>


availability of 0.94.4 and 0.94.5 in maven repo?

2013-02-19 Thread James Taylor
Unless I'm doing something wrong, it looks like the Maven repository 
(http://mvnrepository.com/artifact/org.apache.hbase/hbase) only contains 
HBase up to 0.94.3. Is there a different repo I should use, or if not, 
any ETA on when it'll be updated?


James



RE: Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
0.94.1

Any cmd in shell? Or I need to change balance threshold to 0 an run global 
balancer cmd in shell?



Best Regards,
Raymond Liu

> -Original Message-
> From: Ted Yu [mailto:yuzhih...@gmail.com]
> Sent: Wednesday, February 20, 2013 9:09 AM
> To: user@hbase.apache.org
> Subject: Re: Is there any way to balance one table?
> 
> What version of HBase are you using ?
> 
> 0.94 has per-table load balancing.
> 
> Cheers
> 
> On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond 
> wrote:
> 
> > Hi
> >
> > Is there any way to balance just one table? I found one of my table is
> > not balanced, while all the other table is balanced. So I want to fix
> > this table.
> >
> > Best Regards,
> > Raymond Liu
> >
> >


Re: Is there any way to balance one table?

2013-02-19 Thread Ted Yu
What version of HBase are you using ?

0.94 has per-table load balancing.

Cheers

On Tue, Feb 19, 2013 at 5:01 PM, Liu, Raymond  wrote:

> Hi
>
> Is there any way to balance just one table? I found one of my table is not
> balanced, while all the other table is balanced. So I want to fix this
> table.
>
> Best Regards,
> Raymond Liu
>
>


Is there any way to balance one table?

2013-02-19 Thread Liu, Raymond
Hi

Is there any way to balance just one table? I found one of my table is not 
balanced, while all the other table is balanced. So I want to fix this table.

Best Regards,
Raymond Liu



Re: coprocessor enabled put very slow, help please~~~

2013-02-19 Thread Asaf Mesika
1. Try batching your increment calls to a List and use batch() to
execute it. Should reduce RPC calls by 2 magnitudes.
2. Combine batching with scanning more words, thus aggregating your count
for a certain word thus less Increment commands.
3. Enable Bloom Filters. Should speed up Increment by a factor of 2 at
least.
4. Don't use keyValue.getValue(). It does a System.arraycopy behind the
scenes. Use getBuffer() and getValueOffset() and getValueLength() and
iterate on the existing array. Write your own Split without going into
using String functions which goes through encoding (expensive). Just find
your delimiter by byte comparison.
5. Enable BloomFilters on doc table. It should speed up the checkAndPut.
6. I wouldn't give up WAL. It ain't your bottleneck IMO.

On Monday, February 18, 2013, prakash kadel wrote:

> Thank you guys for your replies,
> Michael,
>I think i didnt make it clear. Here is my use case,
>
> I have text documents to insert in the hbase. (With possible duplicates)
> Suppose i have a document as : " I am working. He is not working"
>
> I want to insert this document to a table in hbase, say table "doc"
>
> =doc table=
> -
> rowKey : doc_id
> cf: doc_content
> value: "I am working. He is not working"
>
> Now, i to create another table that stores the word count, say "doc_idx"
>
> doc_idx table
> ---
> rowKey : I, cf: count, value: 1
> rowKey : am, cf: count, value: 1
> rowKey : working, cf: count, value: 2
> rowKey : He, cf: count, value: 1
> rowKey : is, cf: count, value: 1
> rowKey : not, cf: count, value: 1
>
> My MR job code:
> ==
>
> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
> for(String word : doc_content.split("\\s+")) {
>Increment inc = new Increment(Bytes.toBytes(word));
>inc.addColumn("count", "", 1);
> }
> }
>
> Now, i wanted to do some experiments with coprocessors. So, i modified
> the code as follows.
>
> My MR job code:
> ===
>
> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>
> Coprocessor code:
> ===
>
> public void start(CoprocessorEnvironment env)  {
> pool = new HTablePool(conf, 100);
> }
>
> public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
> compareOp,   comparator,  put, result) {
>
> if(!result) return true; // check if the put succeeded
>
> HTableInterface table_idx = pool.getTable("doc_idx");
>
> try {
>
> for(KeyValue contentKV = put.get("doc_content",
> "")) {
> for(String word :
> contentKV.getValue().split("\\s+")) {
> Increment inc = new
> Increment(Bytes.toBytes(word));
> inc.addColumn("count", "", 1);
> table_idx.increment(inc);
> }
>}
> } finally {
> table_idx.close();
> }
> return true;
> }
>
> public void stop(env) {
> pool.close();
> }
>
> I am a newbee to HBASE. I am not sure this is the way to do.
> Given that, why is the cooprocessor enabled version much slower than
> the one without?
>
>
> Sincerely,
> Prakash Kadel
>
>
> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
> > wrote:
> >
> > The  issue I was talking about was the use of a check and put.
> > The OP wrote:
> > each map inserts to doc table.(checkAndPut)
> > regionobserver coprocessor does a postCheckAndPut and inserts some
> rows to
> > a index table.
> >
> > My question is why does the OP use a checkAndPut, and the
> RegionObserver's postChecAndPut?
> >
> >
> > Here's a good example...
> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
> >
> > The OP doesn't really get in to the use case, so we don't know why the
> Check and Put in the M/R job.
> > He should just be using put() and then a postPut().
> >
> > Another issue... since he's writing to  a different HTable... how? Does
> he create an HTable instance in the start() method of his RO object and
> then reference it later? Or does he create the instance of the HTable on
> the fly in each postCheckAndPut() ?
> > Without seeing his code, we don't know.
> >
> > Note that this is synchronous set of writes. Your overall return from
> the M/R call to put will wait until the second row is inserted.
> >
> > Interestingly enough, you may want to consider disabling the WAL on the
> write to the index.  You can always run a M/R job that rebuilds the index
> should something occur to the system where you might lose the data.
>  Indexes *ARE* expendable. ;-)
> >
> > Does that explain it?
> >
> > -Mike
> >
> > On Feb 18, 2013, at 4:57 AM, yonghu  wrote:
> >
> >> Hi, Michael
> >>
> >> I don't quite understand what do you mean by "round trip back to the
> >> client". In my understa

Re: coprocessor enabled put very slow, help please~~~

2013-02-19 Thread Andrew Purtell
A coprocessor is some code running in a server process. The resources
available and rules of the road are different from client side programming.
HTablePool (and HTable in general) is problematic for server side
programming in my opinion: http://search-hadoop.com/m/XtAi5Fogw32 Since
this comes up now and again seems like a lightweight alternative for server
side IPC could be useful.


On Tue, Feb 19, 2013 at 7:15 AM, Wei Tan  wrote:

> A side question: if HTablePool is not encouraged to be used... how we
> handle the thread safeness in using HTable? Any replacement for
> HTablePool, in plan?
> Thanks,
>
>
> Best Regards,
> Wei
>
>
>
>
> From:   Michel Segel 
> To: "user@hbase.apache.org" ,
> Date:   02/18/2013 09:23 AM
> Subject:Re: coprocessor enabled put very slow, help please~~~
>
>
>
> Why are you using an HTable Pool?
> Why are you closing the table after each iteration through?
>
> Try using 1 HTable object. Turn off WAL
> Initiate in start()
> Close in Stop()
> Surround the use in a try / catch
> If exception caught, re instantiate new HTable connection.
>
> Maybe want to flush the connection after puts.
>
>
> Again not sure why you are using check and put on the base table. Your
> count could be off.
>
> As an example look at poem/rhyme 'Marry had a little lamb'.
> Then check your word count.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Feb 18, 2013, at 7:21 AM, prakash kadel 
> wrote:
>
> > Thank you guys for your replies,
> > Michael,
> >   I think i didnt make it clear. Here is my use case,
> >
> > I have text documents to insert in the hbase. (With possible duplicates)
> > Suppose i have a document as : " I am working. He is not working"
> >
> > I want to insert this document to a table in hbase, say table "doc"
> >
> > =doc table=
> > -
> > rowKey : doc_id
> > cf: doc_content
> > value: "I am working. He is not working"
> >
> > Now, i to create another table that stores the word count, say "doc_idx"
> >
> > doc_idx table
> > ---
> > rowKey : I, cf: count, value: 1
> > rowKey : am, cf: count, value: 1
> > rowKey : working, cf: count, value: 2
> > rowKey : He, cf: count, value: 1
> > rowKey : is, cf: count, value: 1
> > rowKey : not, cf: count, value: 1
> >
> > My MR job code:
> > ==
> >
> > if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
> >for(String word : doc_content.split("\\s+")) {
> >   Increment inc = new Increment(Bytes.toBytes(word));
> >   inc.addColumn("count", "", 1);
> >}
> > }
> >
> > Now, i wanted to do some experiments with coprocessors. So, i modified
> > the code as follows.
> >
> > My MR job code:
> > ===
> >
> > doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
> >
> > Coprocessor code:
> > ===
> >
> >public void start(CoprocessorEnvironment env)  {
> >pool = new HTablePool(conf, 100);
> >}
> >
> >public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
> > compareOp, comparator,  put, result) {
> >
> >if(!result) return true; // check if the put succeeded
> >
> >HTableInterface table_idx = pool.getTable("doc_idx");
> >
> >try {
> >
> >for(KeyValue contentKV = put.get("doc_content", "")) {
> >for(String word :
> > contentKV.getValue().split("\\s+")) {
> >Increment inc = new
> > Increment(Bytes.toBytes(word));
> >inc.addColumn("count", "", 1);
> >table_idx.increment(inc);
> >}
> >   }
> >} finally {
> >table_idx.close();
> >}
> >return true;
> >}
> >
> >public void stop(env) {
> >pool.close();
> >}
> >
> > I am a newbee to HBASE. I am not sure this is the way to do.
> > Given that, why is the cooprocessor enabled version much slower than
> > the one without?
> >
> >
> > Sincerely,
> > Prakash Kadel
> >
> >
> > On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
> >  wrote:
> >>
> >> The  issue I was talking about was the use of a check and put.
> >> The OP wrote:
> >> each map inserts to doc table.(checkAndPut)
> >> regionobserver coprocessor does a postCheckAndPut and inserts some
> rows to
> >> a index table.
> >>
> >> My question is why does the OP use a checkAndPut, and the
> RegionObserver's postChecAndPut?
> >>
> >>
> >> Here's a good example...
>
> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>
> >>
> >> The OP doesn't really get in to the use case, so we don't know why the
> Check and Put in the M/R job.
> >> He should just be using put() and then a postPut().
> >>
> >> Another issue... since he's writing to  a different HTable... how? Does
> he create an HTable instance in the start() method of his RO object and
> then reference it later? Or does he create the instance of the HTable on
> th

Re: Scanning a row for certain md5hash does not work

2013-02-19 Thread Paul van Hoven
Sorry, I had a mistake in my rowkey generation.

Thanks for reading!

2013/2/19 Paul van Hoven :
> I'm currently reading a book about hbase (hbase in action by manning).
> In this book it is explained how to perform a scan if the rowkey is
> made out of a md5 hash (page 45 in the book). My rowkey design (and
> table filling method) looks like this:
>
> SimpleDateFormat dateFormatter = new SimpleDateFormat("-MM-dd");
> SimpleDateFormat timeFormatter = new SimpleDateFormat("HH:mm:ss");
> Date date = dateFormatter.parse("2013-01-01");
>
> for( int i = 0; i < 31; ++i ) {
> for( int k = 0; k < 24; ++k ) {
> for( int j = 0; j < 1; ++j ) {
> //md5() is a custom method that transforms a
> string into a md5 hash
> byte[] ts = md5( dateFormatter.format(date) );
> byte[] tm = md5( timeFormatter.format(date) );
> byte[] ip = md5( generateRandomIPAddress() /* toy 
> method that
> generates ip addresses */ );
> byte[] rowkey = new byte[ ts.length + tm.length + 
> ip.length ];
> System.arraycopy( ts, 0, rowkey, 0, ts.length );
> System.arraycopy( tm, 0, rowkey, ts.length, tm.length 
> );
> System.arraycopy( ip, 0, rowkey, ts.length+tm.length, 
> ip.length );
> Put p = new Put( rowkey );
>
> p.add( Bytes.toBytes("CF"), 
> Bytes.toBytes("SampleCol"),
> Bytes.toBytes( "Value_" + (i+1) + " = " + dateFormatter.format(date) +
> " " + timeFormatter.format(date) ) );
> toyDataTable.put( p );
> }
>
> //custom method that adds an hour to the current date object
> date = addHours( date, 1 );
> }
>
> }
>
> Now I'd like to do the following scan (I more or less took the same
> code from the example in the book):
>
> SimpleDateFormat formatter = new SimpleDateFormat("-MM-dd");
> Date refDate = formatter.parse("2013-01-15");
>
> HTableInterface toyDataTable = pool.getTable("ToyDataTable");
>
> byte[] md5Key = md5( refDate.getTime() +"" );
> int md5Length = 16;
> int longLength = 8;
> byte[] startRow = Bytes.padTail( md5Key, longLength );
> byte[] endRow = Bytes.padTail( md5Key, longLength );
> endRow[md5Length-1]++;
>
> Scan scan = new Scan( startRow, endRow );
> ResultScanner rs = toyDataTable.getScanner( scan );
> for( Result r : rs ) {
> String value =  Bytes.toString( r.getValue( Bytes.toBytes("CF"),
> Bytes.toBytes("SampleCol")) );
> System.out.println( value );
> }
>
> The result is empty. How is that possible?


Scanning a row for certain md5hash does not work

2013-02-19 Thread Paul van Hoven
I'm currently reading a book about hbase (hbase in action by manning).
In this book it is explained how to perform a scan if the rowkey is
made out of a md5 hash (page 45 in the book). My rowkey design (and
table filling method) looks like this:

SimpleDateFormat dateFormatter = new SimpleDateFormat("-MM-dd");
SimpleDateFormat timeFormatter = new SimpleDateFormat("HH:mm:ss");
Date date = dateFormatter.parse("2013-01-01");

for( int i = 0; i < 31; ++i ) {
for( int k = 0; k < 24; ++k ) {
for( int j = 0; j < 1; ++j ) {
//md5() is a custom method that transforms a
string into a md5 hash
byte[] ts = md5( dateFormatter.format(date) );
byte[] tm = md5( timeFormatter.format(date) );
byte[] ip = md5( generateRandomIPAddress() /* toy 
method that
generates ip addresses */ );
byte[] rowkey = new byte[ ts.length + tm.length + 
ip.length ];
System.arraycopy( ts, 0, rowkey, 0, ts.length );
System.arraycopy( tm, 0, rowkey, ts.length, tm.length );
System.arraycopy( ip, 0, rowkey, ts.length+tm.length, 
ip.length );
Put p = new Put( rowkey );

p.add( Bytes.toBytes("CF"), Bytes.toBytes("SampleCol"),
Bytes.toBytes( "Value_" + (i+1) + " = " + dateFormatter.format(date) +
" " + timeFormatter.format(date) ) );
toyDataTable.put( p );
}

//custom method that adds an hour to the current date object
date = addHours( date, 1 );
}

}

Now I'd like to do the following scan (I more or less took the same
code from the example in the book):

SimpleDateFormat formatter = new SimpleDateFormat("-MM-dd");
Date refDate = formatter.parse("2013-01-15");

HTableInterface toyDataTable = pool.getTable("ToyDataTable");

byte[] md5Key = md5( refDate.getTime() +"" );
int md5Length = 16;
int longLength = 8;
byte[] startRow = Bytes.padTail( md5Key, longLength );
byte[] endRow = Bytes.padTail( md5Key, longLength );
endRow[md5Length-1]++;

Scan scan = new Scan( startRow, endRow );
ResultScanner rs = toyDataTable.getScanner( scan );
for( Result r : rs ) {
String value =  Bytes.toString( r.getValue( Bytes.toBytes("CF"),
Bytes.toBytes("SampleCol")) );
System.out.println( value );
}

The result is empty. How is that possible?


Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Nicolas Liochon
As well, an advantage of going only to the servers needed is the famous
MTTR: there are a less chance to go to a dead server or to a region that
just moved.


On Tue, Feb 19, 2013 at 7:42 PM, Nicolas Liochon  wrote:

> Interesting, in the client we're doing a group by location the multiget.
> So we could have the filter as HBase core code, and then we could use it
> in the client for the multiget: compared to my initial proposal, we don't
> have to change anything in the server code and we reuse the filtering
> framework. The filter can be also be used independently.
>
> Is there any issue with this? The reseek seems to be quite smart in the
> way it handles the bloom filters, I don't know if it behaves differently in
> this case vs. a simple get.
>
>
> On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl  wrote:
>
>> I was thinking along the same lines. Doing a skip scan via filter
>> hinting. The problem is as you say that the Filter is instantiated
>> everywhere and it might be of significant size (have to maintain all row
>> keys you are looking for).
>>
>>
>> RegionScanner now a reseek method, it is possible to do this via a
>> coprocessor. They are also loaded per region (but at least not for each
>> store), and one can use the shared coproc state I added to alleviate the
>> memory concern.
>>
>> Thinking about this in terms of multiple scan is interesting. One could
>> identify clusters of close row keys in the Gets and issue a Scan for each
>> cluster.
>>
>>
>> -- Lars
>>
>>
>>
>> 
>>  From: Nicolas Liochon 
>> To: user 
>> Sent: Tuesday, February 19, 2013 9:28 AM
>> Subject: Re: Optimizing Multi Gets in hbase
>>
>> Imho,  the easiest thing to do would be to write a filter.
>> You need to order the rows, then you can use hints to navigate to the next
>> row (SEEK_NEXT_USING_HINT).
>> The main drawback I see is that the filter will be invoked on all regions
>> servers, including the ones that don't need it. But this would also means
>> you have a very specific query pattern (which could be the case, I just
>> don't know), and you can still use the startRow / stopRow of the scan, and
>> create multiple scan if necessary. I'm also interested in Lars' opinion on
>> this.
>>
>> Nicolas
>>
>>
>>
>> On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma 
>> wrote:
>>
>> > I have another question, if I am running a scan wrapped around multiple
>> > rows in the same region, in the following way:
>> >
>> > Scan scan = new scan(getWithMultipleRowsInSameRegion);
>> >
>> > Now, how does execution occur. Is it just a sequential scan across the
>> > entire region or does it seek to hfile blocks containing the actual
>> values.
>> > What I truly mean is, lets say the multi get is on following rows:
>> >
>> > Row1 : HFileBlock1
>> > Row2 : HFileBlock20
>> > Row3 : Does not exist
>> > Row4 : HFileBlock25
>> > Row5 : HFileBlock100
>> >
>> > The efficient way to do this would be to determine the correct blocks
>> using
>> > the index and then searching within the blocks for, say Row1. Then,
>> seek to
>> > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
>> > seeking to + searching within HFileBlocks as needed.
>> >
>> > I am wondering if a scan wrapped around a Get with multiple rows would
>> do
>> > the same ?
>> >
>> > Thanks
>> > Varun
>> >
>> > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon 
>> > wrote:
>> >
>> > > Looking at the code, it seems possible to do this server side within
>> the
>> > > multi invocation: we could group the get by region, and do a single
>> scan.
>> > > We could also add some heuristics if necessary...
>> > >
>> > >
>> > >
>> > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl 
>> wrote:
>> > >
>> > > > I should qualify that statement, actually.
>> > > >
>> > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
>> > > > returned.
>> > > >
>> > > > As James Taylor pointed out to me privately: A fairer comparison
>> would
>> > > > have been to run a scan with a filter that lets x% of the rows pass
>> > (i.e.
>> > > > the selectivity of the scan would be x%) and compare that to a multi
>> > Get
>> > > of
>> > > > the same x% of the row.
>> > > >
>> > > > There we found that a Scan+Filter is more efficient that issuing
>> multi
>> > > > Gets if x is >= 1-2%.
>> > > >
>> > > >
>> > > > Or in other words, translating many Gets into a Scan+Filter is
>> > beneficial
>> > > > if the Scan would return at least 1-2% of the rows to the client.
>> > > > For example:
>> > > > if you are looking for less than 10-20k rows in 1m rows, using muli
>> > Gets
>> > > > is likely more efficient.
>> > > > if you are looking for more than 10-20k rows in 1m rows, using a
>> > > > Scan+Filter is likely more efficient.
>> > > >
>> > > >
>> > > > Of course this is predicated on whether you have an efficient way to
>> > > > represent the rows you are looking for in a filter, so that would
>> > > probably
>> > > > shift this slightly more towards Gets (just im

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Nicolas Liochon
Interesting, in the client we're doing a group by location the multiget.
So we could have the filter as HBase core code, and then we could use it in
the client for the multiget: compared to my initial proposal, we don't have
to change anything in the server code and we reuse the filtering framework.
The filter can be also be used independently.

Is there any issue with this? The reseek seems to be quite smart in the way
it handles the bloom filters, I don't know if it behaves differently in
this case vs. a simple get.


On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl  wrote:

> I was thinking along the same lines. Doing a skip scan via filter hinting.
> The problem is as you say that the Filter is instantiated everywhere and it
> might be of significant size (have to maintain all row keys you are looking
> for).
>
>
> RegionScanner now a reseek method, it is possible to do this via a
> coprocessor. They are also loaded per region (but at least not for each
> store), and one can use the shared coproc state I added to alleviate the
> memory concern.
>
> Thinking about this in terms of multiple scan is interesting. One could
> identify clusters of close row keys in the Gets and issue a Scan for each
> cluster.
>
>
> -- Lars
>
>
>
> 
>  From: Nicolas Liochon 
> To: user 
> Sent: Tuesday, February 19, 2013 9:28 AM
> Subject: Re: Optimizing Multi Gets in hbase
>
> Imho,  the easiest thing to do would be to write a filter.
> You need to order the rows, then you can use hints to navigate to the next
> row (SEEK_NEXT_USING_HINT).
> The main drawback I see is that the filter will be invoked on all regions
> servers, including the ones that don't need it. But this would also means
> you have a very specific query pattern (which could be the case, I just
> don't know), and you can still use the startRow / stopRow of the scan, and
> create multiple scan if necessary. I'm also interested in Lars' opinion on
> this.
>
> Nicolas
>
>
>
> On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma  wrote:
>
> > I have another question, if I am running a scan wrapped around multiple
> > rows in the same region, in the following way:
> >
> > Scan scan = new scan(getWithMultipleRowsInSameRegion);
> >
> > Now, how does execution occur. Is it just a sequential scan across the
> > entire region or does it seek to hfile blocks containing the actual
> values.
> > What I truly mean is, lets say the multi get is on following rows:
> >
> > Row1 : HFileBlock1
> > Row2 : HFileBlock20
> > Row3 : Does not exist
> > Row4 : HFileBlock25
> > Row5 : HFileBlock100
> >
> > The efficient way to do this would be to determine the correct blocks
> using
> > the index and then searching within the blocks for, say Row1. Then, seek
> to
> > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
> > seeking to + searching within HFileBlocks as needed.
> >
> > I am wondering if a scan wrapped around a Get with multiple rows would do
> > the same ?
> >
> > Thanks
> > Varun
> >
> > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon 
> > wrote:
> >
> > > Looking at the code, it seems possible to do this server side within
> the
> > > multi invocation: we could group the get by region, and do a single
> scan.
> > > We could also add some heuristics if necessary...
> > >
> > >
> > >
> > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl 
> wrote:
> > >
> > > > I should qualify that statement, actually.
> > > >
> > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> > > > returned.
> > > >
> > > > As James Taylor pointed out to me privately: A fairer comparison
> would
> > > > have been to run a scan with a filter that lets x% of the rows pass
> > (i.e.
> > > > the selectivity of the scan would be x%) and compare that to a multi
> > Get
> > > of
> > > > the same x% of the row.
> > > >
> > > > There we found that a Scan+Filter is more efficient that issuing
> multi
> > > > Gets if x is >= 1-2%.
> > > >
> > > >
> > > > Or in other words, translating many Gets into a Scan+Filter is
> > beneficial
> > > > if the Scan would return at least 1-2% of the rows to the client.
> > > > For example:
> > > > if you are looking for less than 10-20k rows in 1m rows, using muli
> > Gets
> > > > is likely more efficient.
> > > > if you are looking for more than 10-20k rows in 1m rows, using a
> > > > Scan+Filter is likely more efficient.
> > > >
> > > >
> > > > Of course this is predicated on whether you have an efficient way to
> > > > represent the rows you are looking for in a filter, so that would
> > > probably
> > > > shift this slightly more towards Gets (just imaging a Filter that to
> > > encode
> > > > 100k random row keys to be matched; since Filters are instantiated
> > store
> > > > there is another natural limit there).
> > > >
> > > >
> > > > As I said below, the crux of the matter is having some histograms of
> > your
> > > > data, so that such a decision could be made automatically.
> > > >
> > > >
> > > > -- 

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread lars hofhansl
I was thinking along the same lines. Doing a skip scan via filter hinting. The 
problem is as you say that the Filter is instantiated everywhere and it might 
be of significant size (have to maintain all row keys you are looking for).


RegionScanner now a reseek method, it is possible to do this via a coprocessor. 
They are also loaded per region (but at least not for each store), and one can 
use the shared coproc state I added to alleviate the memory concern.

Thinking about this in terms of multiple scan is interesting. One could 
identify clusters of close row keys in the Gets and issue a Scan for each 
cluster.


-- Lars




 From: Nicolas Liochon 
To: user  
Sent: Tuesday, February 19, 2013 9:28 AM
Subject: Re: Optimizing Multi Gets in hbase
 
Imho,  the easiest thing to do would be to write a filter.
You need to order the rows, then you can use hints to navigate to the next
row (SEEK_NEXT_USING_HINT).
The main drawback I see is that the filter will be invoked on all regions
servers, including the ones that don't need it. But this would also means
you have a very specific query pattern (which could be the case, I just
don't know), and you can still use the startRow / stopRow of the scan, and
create multiple scan if necessary. I'm also interested in Lars' opinion on
this.

Nicolas



On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma  wrote:

> I have another question, if I am running a scan wrapped around multiple
> rows in the same region, in the following way:
>
> Scan scan = new scan(getWithMultipleRowsInSameRegion);
>
> Now, how does execution occur. Is it just a sequential scan across the
> entire region or does it seek to hfile blocks containing the actual values.
> What I truly mean is, lets say the multi get is on following rows:
>
> Row1 : HFileBlock1
> Row2 : HFileBlock20
> Row3 : Does not exist
> Row4 : HFileBlock25
> Row5 : HFileBlock100
>
> The efficient way to do this would be to determine the correct blocks using
> the index and then searching within the blocks for, say Row1. Then, seek to
> HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
> seeking to + searching within HFileBlocks as needed.
>
> I am wondering if a scan wrapped around a Get with multiple rows would do
> the same ?
>
> Thanks
> Varun
>
> On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon 
> wrote:
>
> > Looking at the code, it seems possible to do this server side within the
> > multi invocation: we could group the get by region, and do a single scan.
> > We could also add some heuristics if necessary...
> >
> >
> >
> > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl  wrote:
> >
> > > I should qualify that statement, actually.
> > >
> > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> > > returned.
> > >
> > > As James Taylor pointed out to me privately: A fairer comparison would
> > > have been to run a scan with a filter that lets x% of the rows pass
> (i.e.
> > > the selectivity of the scan would be x%) and compare that to a multi
> Get
> > of
> > > the same x% of the row.
> > >
> > > There we found that a Scan+Filter is more efficient that issuing multi
> > > Gets if x is >= 1-2%.
> > >
> > >
> > > Or in other words, translating many Gets into a Scan+Filter is
> beneficial
> > > if the Scan would return at least 1-2% of the rows to the client.
> > > For example:
> > > if you are looking for less than 10-20k rows in 1m rows, using muli
> Gets
> > > is likely more efficient.
> > > if you are looking for more than 10-20k rows in 1m rows, using a
> > > Scan+Filter is likely more efficient.
> > >
> > >
> > > Of course this is predicated on whether you have an efficient way to
> > > represent the rows you are looking for in a filter, so that would
> > probably
> > > shift this slightly more towards Gets (just imaging a Filter that to
> > encode
> > > 100k random row keys to be matched; since Filters are instantiated
> store
> > > there is another natural limit there).
> > >
> > >
> > > As I said below, the crux of the matter is having some histograms of
> your
> > > data, so that such a decision could be made automatically.
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > 
> > >  From: lars hofhansl 
> > > To: "user@hbase.apache.org" 
> > > Sent: Monday, February 18, 2013 5:48 PM
> > > Subject: Re: Optimizing Multi Gets in hbase
> > >
> > > As it happens we did some tests around last week.
> > > Turns out doing Gets in batches instead of a scan still gives you 1/3
> of
> > > the performance.
> > >
> > > I.e. when you have a table with, say, 10m rows and scanning take N
> > > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
> > pretty
> > > impressive.
> > >
> > > Now, this is with all data in the cache!
> > > When the data is not in the cache and the Gets are random it is many
> > > orders of magnitude slower, as the Gets are sprayed all over the disk.
> In
> > > that case sorting the Gets and issui

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Varun Sharma
The other suggestion, sounds better to me where the multi call is modified
to run the Get(s) with this new filter or just initiate a scan with all the
get(s). Since the client automatically groups the multi calls by region
server and only calls the respective region servers. That would eliminate
calling all region servers. This might be an interesting benchmark to run.

On Tue, Feb 19, 2013 at 9:28 AM, Nicolas Liochon  wrote:

> Imho,  the easiest thing to do would be to write a filter.
> You need to order the rows, then you can use hints to navigate to the next
> row (SEEK_NEXT_USING_HINT).
> The main drawback I see is that the filter will be invoked on all regions
> servers, including the ones that don't need it. But this would also means
> you have a very specific query pattern (which could be the case, I just
> don't know), and you can still use the startRow / stopRow of the scan, and
> create multiple scan if necessary. I'm also interested in Lars' opinion on
> this.
>
> Nicolas
>
>
>
> On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma  wrote:
>
> > I have another question, if I am running a scan wrapped around multiple
> > rows in the same region, in the following way:
> >
> > Scan scan = new scan(getWithMultipleRowsInSameRegion);
> >
> > Now, how does execution occur. Is it just a sequential scan across the
> > entire region or does it seek to hfile blocks containing the actual
> values.
> > What I truly mean is, lets say the multi get is on following rows:
> >
> > Row1 : HFileBlock1
> > Row2 : HFileBlock20
> > Row3 : Does not exist
> > Row4 : HFileBlock25
> > Row5 : HFileBlock100
> >
> > The efficient way to do this would be to determine the correct blocks
> using
> > the index and then searching within the blocks for, say Row1. Then, seek
> to
> > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
> > seeking to + searching within HFileBlocks as needed.
> >
> > I am wondering if a scan wrapped around a Get with multiple rows would do
> > the same ?
> >
> > Thanks
> > Varun
> >
> > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon 
> > wrote:
> >
> > > Looking at the code, it seems possible to do this server side within
> the
> > > multi invocation: we could group the get by region, and do a single
> scan.
> > > We could also add some heuristics if necessary...
> > >
> > >
> > >
> > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl 
> wrote:
> > >
> > > > I should qualify that statement, actually.
> > > >
> > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> > > > returned.
> > > >
> > > > As James Taylor pointed out to me privately: A fairer comparison
> would
> > > > have been to run a scan with a filter that lets x% of the rows pass
> > (i.e.
> > > > the selectivity of the scan would be x%) and compare that to a multi
> > Get
> > > of
> > > > the same x% of the row.
> > > >
> > > > There we found that a Scan+Filter is more efficient that issuing
> multi
> > > > Gets if x is >= 1-2%.
> > > >
> > > >
> > > > Or in other words, translating many Gets into a Scan+Filter is
> > beneficial
> > > > if the Scan would return at least 1-2% of the rows to the client.
> > > > For example:
> > > > if you are looking for less than 10-20k rows in 1m rows, using muli
> > Gets
> > > > is likely more efficient.
> > > > if you are looking for more than 10-20k rows in 1m rows, using a
> > > > Scan+Filter is likely more efficient.
> > > >
> > > >
> > > > Of course this is predicated on whether you have an efficient way to
> > > > represent the rows you are looking for in a filter, so that would
> > > probably
> > > > shift this slightly more towards Gets (just imaging a Filter that to
> > > encode
> > > > 100k random row keys to be matched; since Filters are instantiated
> > store
> > > > there is another natural limit there).
> > > >
> > > >
> > > > As I said below, the crux of the matter is having some histograms of
> > your
> > > > data, so that such a decision could be made automatically.
> > > >
> > > >
> > > > -- Lars
> > > >
> > > >
> > > >
> > > > 
> > > >  From: lars hofhansl 
> > > > To: "user@hbase.apache.org" 
> > > > Sent: Monday, February 18, 2013 5:48 PM
> > > > Subject: Re: Optimizing Multi Gets in hbase
> > > >
> > > > As it happens we did some tests around last week.
> > > > Turns out doing Gets in batches instead of a scan still gives you 1/3
> > of
> > > > the performance.
> > > >
> > > > I.e. when you have a table with, say, 10m rows and scanning take N
> > > > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
> > > pretty
> > > > impressive.
> > > >
> > > > Now, this is with all data in the cache!
> > > > When the data is not in the cache and the Gets are random it is many
> > > > orders of magnitude slower, as the Gets are sprayed all over the
> disk.
> > In
> > > > that case sorting the Gets and issuing scans would indeed be much
> more
> > > > efficient.
> > > >
> > > >
> > > > The Gets in a batch are alr

Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
You can use 
FuzzyRowFilterto
do that.

Have a look at this
link.
You might find it helpful.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Feb 19, 2013 at 11:20 PM, Paul van Hoven <
paul.van.ho...@googlemail.com> wrote:

> Yeah it worked fine.
>
> But as I understand: If I prefix my row key with something like
>
> md5-hash + timestamp
>
> then the rowkeys are probably evenly distributed but how would I
> perform then a scan restricted to a special time range?
>
>
> 2013/2/19 Mohammad Tariq :
> > No. before the timestamp. All the row keys which are identical go to the
> > same region. This is the default Hbase behavior and is meant to make the
> > performance better. But sometimes the machine gets overloaded with reads
> > and writes because we get concentrated on that particular machine. For
> > example timeseries data. So it's better to hash the keys in order to make
> > them go to all the machines equally. HTH
> >
> > BTW, did that range query work??
> >
> > Warm Regards,
> > Tariq
> > https://mtariq.jux.com/
> > cloudfront.blogspot.com
> >
> >
> > On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven <
> > paul.van.ho...@googlemail.com> wrote:
> >
> >> Hey Tariq,
> >>
> >> thanks for your quick answer. I'm not sure if I got the idea in the
> >> seond part of your answer. You mean if I use a timestamp as a rowkey I
> >> should append a hash like this:
> >>
> >> 135727920+MD5HASH
> >>
> >> and then the data would be distributed more equally?
> >>
> >>
> >> 2013/2/19 Mohammad Tariq :
> >> > Hello Paul,
> >> >
> >> > Try this and see if it works :
> >> >scan.setStartRow(Bytes.toBytes(startDate.getTime() + ""));
> >> >scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ""));
> >> >
> >> > Also try not to use TS as the rowkey, as it may lead to RS
> hotspotting.
> >> > Just add a hash to your rowkeys so that data is distributed evenly on
> all
> >> > the RSs.
> >> >
> >> > Warm Regards,
> >> > Tariq
> >> > https://mtariq.jux.com/
> >> > cloudfront.blogspot.com
> >> >
> >> >
> >> > On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven <
> >> > paul.van.ho...@googlemail.com> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I'm currently playing with hbase. The design of the rowkey seems to
> be
> >> >> critical.
> >> >>
> >> >> The rowkey for a certain database table of mine is:
> >> >>
> >> >> timestamp+ipaddress
> >> >>
> >> >> It looks something like this when performing a scan on the table in
> the
> >> >> shell:
> >> >> hbase(main):012:0> scan 'ToyDataTable'
> >> >> ROW COLUMN+CELL
> >> >>  135702000+192.168.178.9column=CF:SampleCol,
> >> >> timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
> >> >>
> >> >> Since I got several rows for different timestamps I'd like to tell a
> >> >> scan to just a region of the table for example from 2013-01-07 to
> >> >> 2013-01-09. Previously I only had a timestamp as the rowkey and I
> >> >> could restrict the rowkey like that:
> >> >>
> >> >> SimpleDateFormat formatter = new SimpleDateFormat("-MM-dd
> >> HH:mm:ss");
> >> >> Date startDate = formatter.parse("2013-01-07
> >> >> 07:00:00");
> >> >> Date endDate = formatter.parse("2013-01-10
> >> >> 07:00:00");
> >> >>
> >> >> HTableInterface toyDataTable =
> >> >> pool.getTable("ToyDataTable");
> >> >> Scan scan = new Scan( Bytes.toBytes(
> >> >> startDate.getTime() ),
> >> >> Bytes.toBytes( endDate.getTime() ) );
> >> >>
> >> >> But this no longer works with my new design.
> >> >>
> >> >> Is there a way to tell the scan object to filter the rows with
> respect
> >> >> to the timestamp, or do I have to use a filter object?
> >> >>
> >>
>


Re: Rowkey design question

2013-02-19 Thread Paul van Hoven
Yeah it worked fine.

But as I understand: If I prefix my row key with something like

md5-hash + timestamp

then the rowkeys are probably evenly distributed but how would I
perform then a scan restricted to a special time range?


2013/2/19 Mohammad Tariq :
> No. before the timestamp. All the row keys which are identical go to the
> same region. This is the default Hbase behavior and is meant to make the
> performance better. But sometimes the machine gets overloaded with reads
> and writes because we get concentrated on that particular machine. For
> example timeseries data. So it's better to hash the keys in order to make
> them go to all the machines equally. HTH
>
> BTW, did that range query work??
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven <
> paul.van.ho...@googlemail.com> wrote:
>
>> Hey Tariq,
>>
>> thanks for your quick answer. I'm not sure if I got the idea in the
>> seond part of your answer. You mean if I use a timestamp as a rowkey I
>> should append a hash like this:
>>
>> 135727920+MD5HASH
>>
>> and then the data would be distributed more equally?
>>
>>
>> 2013/2/19 Mohammad Tariq :
>> > Hello Paul,
>> >
>> > Try this and see if it works :
>> >scan.setStartRow(Bytes.toBytes(startDate.getTime() + ""));
>> >scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ""));
>> >
>> > Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
>> > Just add a hash to your rowkeys so that data is distributed evenly on all
>> > the RSs.
>> >
>> > Warm Regards,
>> > Tariq
>> > https://mtariq.jux.com/
>> > cloudfront.blogspot.com
>> >
>> >
>> > On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven <
>> > paul.van.ho...@googlemail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> I'm currently playing with hbase. The design of the rowkey seems to be
>> >> critical.
>> >>
>> >> The rowkey for a certain database table of mine is:
>> >>
>> >> timestamp+ipaddress
>> >>
>> >> It looks something like this when performing a scan on the table in the
>> >> shell:
>> >> hbase(main):012:0> scan 'ToyDataTable'
>> >> ROW COLUMN+CELL
>> >>  135702000+192.168.178.9column=CF:SampleCol,
>> >> timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
>> >>
>> >> Since I got several rows for different timestamps I'd like to tell a
>> >> scan to just a region of the table for example from 2013-01-07 to
>> >> 2013-01-09. Previously I only had a timestamp as the rowkey and I
>> >> could restrict the rowkey like that:
>> >>
>> >> SimpleDateFormat formatter = new SimpleDateFormat("-MM-dd
>> HH:mm:ss");
>> >> Date startDate = formatter.parse("2013-01-07
>> >> 07:00:00");
>> >> Date endDate = formatter.parse("2013-01-10
>> >> 07:00:00");
>> >>
>> >> HTableInterface toyDataTable =
>> >> pool.getTable("ToyDataTable");
>> >> Scan scan = new Scan( Bytes.toBytes(
>> >> startDate.getTime() ),
>> >> Bytes.toBytes( endDate.getTime() ) );
>> >>
>> >> But this no longer works with my new design.
>> >>
>> >> Is there a way to tell the scan object to filter the rows with respect
>> >> to the timestamp, or do I have to use a filter object?
>> >>
>>


Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
No. before the timestamp. All the row keys which are identical go to the
same region. This is the default Hbase behavior and is meant to make the
performance better. But sometimes the machine gets overloaded with reads
and writes because we get concentrated on that particular machine. For
example timeseries data. So it's better to hash the keys in order to make
them go to all the machines equally. HTH

BTW, did that range query work??

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Feb 19, 2013 at 9:54 PM, Paul van Hoven <
paul.van.ho...@googlemail.com> wrote:

> Hey Tariq,
>
> thanks for your quick answer. I'm not sure if I got the idea in the
> seond part of your answer. You mean if I use a timestamp as a rowkey I
> should append a hash like this:
>
> 135727920+MD5HASH
>
> and then the data would be distributed more equally?
>
>
> 2013/2/19 Mohammad Tariq :
> > Hello Paul,
> >
> > Try this and see if it works :
> >scan.setStartRow(Bytes.toBytes(startDate.getTime() + ""));
> >scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ""));
> >
> > Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
> > Just add a hash to your rowkeys so that data is distributed evenly on all
> > the RSs.
> >
> > Warm Regards,
> > Tariq
> > https://mtariq.jux.com/
> > cloudfront.blogspot.com
> >
> >
> > On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven <
> > paul.van.ho...@googlemail.com> wrote:
> >
> >> Hi,
> >>
> >> I'm currently playing with hbase. The design of the rowkey seems to be
> >> critical.
> >>
> >> The rowkey for a certain database table of mine is:
> >>
> >> timestamp+ipaddress
> >>
> >> It looks something like this when performing a scan on the table in the
> >> shell:
> >> hbase(main):012:0> scan 'ToyDataTable'
> >> ROW COLUMN+CELL
> >>  135702000+192.168.178.9column=CF:SampleCol,
> >> timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
> >>
> >> Since I got several rows for different timestamps I'd like to tell a
> >> scan to just a region of the table for example from 2013-01-07 to
> >> 2013-01-09. Previously I only had a timestamp as the rowkey and I
> >> could restrict the rowkey like that:
> >>
> >> SimpleDateFormat formatter = new SimpleDateFormat("-MM-dd
> HH:mm:ss");
> >> Date startDate = formatter.parse("2013-01-07
> >> 07:00:00");
> >> Date endDate = formatter.parse("2013-01-10
> >> 07:00:00");
> >>
> >> HTableInterface toyDataTable =
> >> pool.getTable("ToyDataTable");
> >> Scan scan = new Scan( Bytes.toBytes(
> >> startDate.getTime() ),
> >> Bytes.toBytes( endDate.getTime() ) );
> >>
> >> But this no longer works with my new design.
> >>
> >> Is there a way to tell the scan object to filter the rows with respect
> >> to the timestamp, or do I have to use a filter object?
> >>
>


Re: Co-Processor in scanning the HBase's Table

2013-02-19 Thread Farrokh Shahriari
Thanks you guys

On Mon, Feb 18, 2013 at 12:00 PM, Amit Sela  wrote:

> Yes... that was emailing half asleep... :)
>
> On Mon, Feb 18, 2013 at 7:23 AM, Anoop Sam John 
> wrote:
>
> > We dont have any hook like postScan()..  In ur case you can try with
> > postScannerClose()..  This will be called once per region. When the scan
> on
> > that region is over the scanner opened on that region will get closed and
> > at that time this hook will get executed.
> >
> > -Anoop-
> > 
> > From: Farrokh Shahriari [mohandes.zebeleh...@gmail.com]
> > Sent: Monday, February 18, 2013 10:27 AM
> > To: user@hbase.apache.org
> > Cc: cdh-u...@cloudera.org
> > Subject: Re: Co-Processor in scanning the HBase's Table
> >
> > Thanks you Amit,I will check that.
> > @Anoop: I wanna run that just after scanning a region or after scanning
> the
> > regions that to belong to one regionserver.
> >
> > On Mon, Feb 18, 2013 at 7:45 AM, Anoop Sam John 
> > wrote:
> >
> > > >I wanna use a custom code after scanning a large table and prefer to
> run
> > > the code after scanning each region
> > >
> > > Exactly at what point you want to run your custom code?  We have hooks
> at
> > > points like opening a scanner at a region, closing scanner at a region,
> > > calling next (pre/post) etc
> > >
> > > -Anoop-
> > > 
> > > From: Farrokh Shahriari [mohandes.zebeleh...@gmail.com]
> > > Sent: Monday, February 18, 2013 12:21 AM
> > > To: cdh-u...@cloudera.org; user@hbase.apache.org
> > > Subject: Co-Processor in scanning the HBase's Table
> > >
> > > Hi there
> > > I wanna use a custom code after scanning a large table and prefer to
> run
> > > the code after scanning each region.I know that I should use
> > > co-processor,but don't know which of Observer,Endpoint or both of them
> I
> > > should use ? Is there any simple example of them ?
> > >
> > > Tnx
> > >
> >
>


Re: Using HBase for Deduping

2013-02-19 Thread Rahul Ravindran
I could surround with a Try..Catch, but that would each time I insert a UUID 
for the first time (99% of the time), I would do a checkAndPut(), catch the 
resultant exception and perform a Put; so, 2 operations each reduce invocation, 
which is what I was looking to avoid



 From: Michael Segel 
To: user@hbase.apache.org; Rahul Ravindran  
Sent: Friday, February 15, 2013 9:24 AM
Subject: Re: Using HBase for Deduping
 

Interesting. 

Surround with a Try Catch? 

But it sounds like you're on the right path. 

Happy Coding!


On Feb 15, 2013, at 11:12 AM, Rahul Ravindran  wrote:

I had tried checkAndPut yesterday with a null passed as the value and it had 
thrown an exception when the row did not exist. Perhaps, I was doing something 
wrong. Will try that again, since, yes, I would prefer a checkAndPut().
>
>
>
>From: Michael Segel 
>To: user@hbase.apache.org 
>Cc: Rahul Ravindran  
>Sent: Friday, February 15, 2013 4:36 AM
>Subject: Re: Using HBase for Deduping
>
>
>On Feb 15, 2013, at 3:07 AM, Asaf Mesika  wrote:
>
>
>Michael, this means read for every write?
>>
>>Yes and no. 
>
>At the macro level, a read for every write would mean that your client would 
>read a record from HBase, and then based on some logic it would either write a 
>record, or not. 
>
>So that you have a lot of overhead in the initial get() and then put(). 
>
>At this macro level, with a Check and Put you have less overhead because of a 
>single message to HBase.
>
>Intermal to HBase, you would still have to check the value in the row, if it 
>exists and then perform an insert or not. 
>
>WIth respect to your billion events an hour... 
>
>dividing by 3600 to get the number of events in a second. You would have less 
>than 300,000 events a second. 
>
>What exactly are you doing and how large are those events? 
>
>Since you are processing these events in a batch job, timing doesn't appear to 
>be that important and of course there is also async hbase which may improve 
>some of the performance. 
>
>YMMV but this is a good example of the checkAndPut()
>
>
>
>
>On Friday, February 15, 2013, Michael Segel wrote:
>>
>>
>>What constitutes a duplicate?
>>>
>>>An over simplification is to do a HTable.checkAndPut() where you do the
>>>put if the column doesn't exist.
>>>Then if the row is inserted (TRUE) return value, you push the event.
>>>
>>>That will do what you want.
>>>
>>>At least at first blush.
>>>
>>>
>>>
>>>On Feb 14, 2013, at 3:24 PM, Viral Bajaria 
>>>wrote:
>>>
>>>
>>>Given the size of the data (> 1B rows) and the frequency of job run (once
per hour), I don't think your most optimal solution is to lookup HBase
for
>>>
>>>every single event. You will benefit more by loading the HBase table
directly in your MR job.

In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique
UUID's ?
>>>
>>>
Also once you have done the unique, are you going to use the data again
in
>>>
>>>some other way i.e. online serving of traffic or some other analysis ? Or
this is just to compute some unique #'s ?

It will be more helpful if you describe your final use case of the
computed
>>>
>>>data too. Given the amount of back and forth, we can take it off list too
and summarize the conversation for the list.

On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran 
wrote:
>>>
>>>

We can't rely on the the assumption event dupes will not dupe outside an
>hour boundary. So, your take is that, doing a lookup per event within
>the
>>>
>>>MR job is going to be bad?
>
>
>
>From: Viral Bajaria 
>To: Rahul Ravindran 
>Cc: "user@hbase.apache.org" 
>Sent: Thursday, February 14, 2013 12:48 PM
>Subject: Re: Using HBase for Deduping
>
>You could do with a 2-pronged approach here i.e. some MR and some HBase
>lookups. I don't think this is the best solution either given the # of
>events you will get.
>
>FWIW, the solution below again relies on the assumption that if a event
>is
>>>
>>>duped in the same hour it won't have a dupe outside of that hour
>boundary.
>>>
>>>If it can have then you are better of with running a MR job with the
>current hour + another 3 hours of data or an MR job with the current
>hour +
>>>
>>>the HBase table as input to the job too (i.e. no HBase lookups, just
>read
>>>
>>>the HFile directly) ?
>
>- Run a MR job which de-dupes events for the current hour i.e. only
>runs on
>>>
>>>1 hour worth of data.
>- Mark records which you were not able to de-dupe in the current run
>- For the records that you were not able to de-dupe, check against HBase
>whether you saw that event in the past. If you did, you can drop the
>current event or update the event to the new value (based on your
>business
>>>
>>>logic)
>- Save all the de-duped events (via HBase bulk upload)
>
>>

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Nicolas Liochon
Imho,  the easiest thing to do would be to write a filter.
You need to order the rows, then you can use hints to navigate to the next
row (SEEK_NEXT_USING_HINT).
The main drawback I see is that the filter will be invoked on all regions
servers, including the ones that don't need it. But this would also means
you have a very specific query pattern (which could be the case, I just
don't know), and you can still use the startRow / stopRow of the scan, and
create multiple scan if necessary. I'm also interested in Lars' opinion on
this.

Nicolas



On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma  wrote:

> I have another question, if I am running a scan wrapped around multiple
> rows in the same region, in the following way:
>
> Scan scan = new scan(getWithMultipleRowsInSameRegion);
>
> Now, how does execution occur. Is it just a sequential scan across the
> entire region or does it seek to hfile blocks containing the actual values.
> What I truly mean is, lets say the multi get is on following rows:
>
> Row1 : HFileBlock1
> Row2 : HFileBlock20
> Row3 : Does not exist
> Row4 : HFileBlock25
> Row5 : HFileBlock100
>
> The efficient way to do this would be to determine the correct blocks using
> the index and then searching within the blocks for, say Row1. Then, seek to
> HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
> seeking to + searching within HFileBlocks as needed.
>
> I am wondering if a scan wrapped around a Get with multiple rows would do
> the same ?
>
> Thanks
> Varun
>
> On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon 
> wrote:
>
> > Looking at the code, it seems possible to do this server side within the
> > multi invocation: we could group the get by region, and do a single scan.
> > We could also add some heuristics if necessary...
> >
> >
> >
> > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl  wrote:
> >
> > > I should qualify that statement, actually.
> > >
> > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> > > returned.
> > >
> > > As James Taylor pointed out to me privately: A fairer comparison would
> > > have been to run a scan with a filter that lets x% of the rows pass
> (i.e.
> > > the selectivity of the scan would be x%) and compare that to a multi
> Get
> > of
> > > the same x% of the row.
> > >
> > > There we found that a Scan+Filter is more efficient that issuing multi
> > > Gets if x is >= 1-2%.
> > >
> > >
> > > Or in other words, translating many Gets into a Scan+Filter is
> beneficial
> > > if the Scan would return at least 1-2% of the rows to the client.
> > > For example:
> > > if you are looking for less than 10-20k rows in 1m rows, using muli
> Gets
> > > is likely more efficient.
> > > if you are looking for more than 10-20k rows in 1m rows, using a
> > > Scan+Filter is likely more efficient.
> > >
> > >
> > > Of course this is predicated on whether you have an efficient way to
> > > represent the rows you are looking for in a filter, so that would
> > probably
> > > shift this slightly more towards Gets (just imaging a Filter that to
> > encode
> > > 100k random row keys to be matched; since Filters are instantiated
> store
> > > there is another natural limit there).
> > >
> > >
> > > As I said below, the crux of the matter is having some histograms of
> your
> > > data, so that such a decision could be made automatically.
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > 
> > >  From: lars hofhansl 
> > > To: "user@hbase.apache.org" 
> > > Sent: Monday, February 18, 2013 5:48 PM
> > > Subject: Re: Optimizing Multi Gets in hbase
> > >
> > > As it happens we did some tests around last week.
> > > Turns out doing Gets in batches instead of a scan still gives you 1/3
> of
> > > the performance.
> > >
> > > I.e. when you have a table with, say, 10m rows and scanning take N
> > > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
> > pretty
> > > impressive.
> > >
> > > Now, this is with all data in the cache!
> > > When the data is not in the cache and the Gets are random it is many
> > > orders of magnitude slower, as the Gets are sprayed all over the disk.
> In
> > > that case sorting the Gets and issuing scans would indeed be much more
> > > efficient.
> > >
> > >
> > > The Gets in a batch are already sorted on the client, but as N. says it
> > is
> > > hard to determine when to turn many Gets into a Scan with filters
> > > automatically. Without statistics/histograms I'd even wager a guess
> that
> > > would be impossible to do.
> > > Imagine you issue 1 random Gets, but your table has 10bn rows, in
> > that
> > > case it is almost certain that the Gets are faster than a scan.
> > > Now image the Gets only cover a small key range. With statistics we
> could
> > > tell whether it would beneficial to turn this into a scan.
> > >
> > > It's not that hard to add statistics to HBase. Would do it as part of
> the
> > > compactions, and record the histograms in some table.
> > >
> 

Re: coprocessor enabled put very slow, help please~~~

2013-02-19 Thread Michael Segel
I should follow up with that I was asking why he was using an HTable Pool, not 
saying that it was wrong. 

Still. I think in the pool the writes shouldn't have to go to the WAL. 


On Feb 19, 2013, at 10:01 AM, Michael Segel  wrote:

> Good question.. 
> 
> You create a class MyRO. 
> 
> How many instances of  MyRO exist per RS?
> 
> How many queries can access the instance MyRO at the same time? 
> 
> 
> 
> 
> On Feb 19, 2013, at 9:15 AM, Wei Tan  wrote:
> 
>> A side question: if HTablePool is not encouraged to be used... how we 
>> handle the thread safeness in using HTable? Any replacement for 
>> HTablePool, in plan?
>> Thanks,
>> 
>> 
>> Best Regards,
>> Wei
>> 
>> 
>> 
>> 
>> From:   Michel Segel 
>> To: "user@hbase.apache.org" , 
>> Date:   02/18/2013 09:23 AM
>> Subject:Re: coprocessor enabled put very slow, help please~~~
>> 
>> 
>> 
>> Why are you using an HTable Pool?
>> Why are you closing the table after each iteration through?
>> 
>> Try using 1 HTable object. Turn off WAL
>> Initiate in start()
>> Close in Stop()
>> Surround the use in a try / catch
>> If exception caught, re instantiate new HTable connection.
>> 
>> Maybe want to flush the connection after puts. 
>> 
>> 
>> Again not sure why you are using check and put on the base table. Your 
>> count could be off.
>> 
>> As an example look at poem/rhyme 'Marry had a little lamb'.
>> Then check your word count.
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On Feb 18, 2013, at 7:21 AM, prakash kadel  
>> wrote:
>> 
>>> Thank you guys for your replies,
>>> Michael,
>>> I think i didnt make it clear. Here is my use case,
>>> 
>>> I have text documents to insert in the hbase. (With possible duplicates)
>>> Suppose i have a document as : " I am working. He is not working"
>>> 
>>> I want to insert this document to a table in hbase, say table "doc"
>>> 
>>> =doc table=
>>> -
>>> rowKey : doc_id
>>> cf: doc_content
>>> value: "I am working. He is not working"
>>> 
>>> Now, i to create another table that stores the word count, say "doc_idx"
>>> 
>>> doc_idx table
>>> ---
>>> rowKey : I, cf: count, value: 1
>>> rowKey : am, cf: count, value: 1
>>> rowKey : working, cf: count, value: 2
>>> rowKey : He, cf: count, value: 1
>>> rowKey : is, cf: count, value: 1
>>> rowKey : not, cf: count, value: 1
>>> 
>>> My MR job code:
>>> ==
>>> 
>>> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>>>  for(String word : doc_content.split("\\s+")) {
>>> Increment inc = new Increment(Bytes.toBytes(word));
>>> inc.addColumn("count", "", 1);
>>>  }
>>> }
>>> 
>>> Now, i wanted to do some experiments with coprocessors. So, i modified
>>> the code as follows.
>>> 
>>> My MR job code:
>>> ===
>>> 
>>> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>>> 
>>> Coprocessor code:
>>> ===
>>> 
>>>  public void start(CoprocessorEnvironment env)  {
>>>  pool = new HTablePool(conf, 100);
>>>  }
>>> 
>>>  public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
>>> compareOp, comparator,  put, result) {
>>> 
>>>  if(!result) return true; // check if the put succeeded
>>> 
>>>  HTableInterface table_idx = pool.getTable("doc_idx");
>>> 
>>>  try {
>>> 
>>>  for(KeyValue contentKV = put.get("doc_content", "")) {
>>>  for(String word :
>>> contentKV.getValue().split("\\s+")) {
>>>  Increment inc = new
>>> Increment(Bytes.toBytes(word));
>>>  inc.addColumn("count", "", 1);
>>>  table_idx.increment(inc);
>>>  }
>>> }
>>>  } finally {
>>>  table_idx.close();
>>>  }
>>>  return true;
>>>  }
>>> 
>>>  public void stop(env) {
>>>  pool.close();
>>>  }
>>> 
>>> I am a newbee to HBASE. I am not sure this is the way to do.
>>> Given that, why is the cooprocessor enabled version much slower than
>>> the one without?
>>> 
>>> 
>>> Sincerely,
>>> Prakash Kadel
>>> 
>>> 
>>> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
>>>  wrote:
 
 The  issue I was talking about was the use of a check and put.
 The OP wrote:
 each map inserts to doc table.(checkAndPut)
 regionobserver coprocessor does a postCheckAndPut and inserts some 
>> rows to
 a index table.
 
 My question is why does the OP use a checkAndPut, and the 
>> RegionObserver's postChecAndPut?
 
 
 Here's a good example... 
>> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>> 
 
 The OP doesn't really get in to the use case, so we don't know why the 
>> Check and Put in the M/R job.
 He should just be using put() and then a postPut().
 
 Another issue... since he's writing to  a different HTable... how? Does 
>> he create an HTable instance in the start() method of his RO object and 

Re: Rowkey design question

2013-02-19 Thread Paul van Hoven
Hey Tariq,

thanks for your quick answer. I'm not sure if I got the idea in the
seond part of your answer. You mean if I use a timestamp as a rowkey I
should append a hash like this:

135727920+MD5HASH

and then the data would be distributed more equally?


2013/2/19 Mohammad Tariq :
> Hello Paul,
>
> Try this and see if it works :
>scan.setStartRow(Bytes.toBytes(startDate.getTime() + ""));
>scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ""));
>
> Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
> Just add a hash to your rowkeys so that data is distributed evenly on all
> the RSs.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven <
> paul.van.ho...@googlemail.com> wrote:
>
>> Hi,
>>
>> I'm currently playing with hbase. The design of the rowkey seems to be
>> critical.
>>
>> The rowkey for a certain database table of mine is:
>>
>> timestamp+ipaddress
>>
>> It looks something like this when performing a scan on the table in the
>> shell:
>> hbase(main):012:0> scan 'ToyDataTable'
>> ROW COLUMN+CELL
>>  135702000+192.168.178.9column=CF:SampleCol,
>> timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
>>
>> Since I got several rows for different timestamps I'd like to tell a
>> scan to just a region of the table for example from 2013-01-07 to
>> 2013-01-09. Previously I only had a timestamp as the rowkey and I
>> could restrict the rowkey like that:
>>
>> SimpleDateFormat formatter = new SimpleDateFormat("-MM-dd HH:mm:ss");
>> Date startDate = formatter.parse("2013-01-07
>> 07:00:00");
>> Date endDate = formatter.parse("2013-01-10
>> 07:00:00");
>>
>> HTableInterface toyDataTable =
>> pool.getTable("ToyDataTable");
>> Scan scan = new Scan( Bytes.toBytes(
>> startDate.getTime() ),
>> Bytes.toBytes( endDate.getTime() ) );
>>
>> But this no longer works with my new design.
>>
>> Is there a way to tell the scan object to filter the rows with respect
>> to the timestamp, or do I have to use a filter object?
>>


Re: Rowkey design question

2013-02-19 Thread Mohammad Tariq
Hello Paul,

Try this and see if it works :
   scan.setStartRow(Bytes.toBytes(startDate.getTime() + ""));
   scan.setStopRow(Bytes.toBytes(endDate.getTime() + 1 + ""));

Also try not to use TS as the rowkey, as it may lead to RS hotspotting.
Just add a hash to your rowkeys so that data is distributed evenly on all
the RSs.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Tue, Feb 19, 2013 at 9:41 PM, Paul van Hoven <
paul.van.ho...@googlemail.com> wrote:

> Hi,
>
> I'm currently playing with hbase. The design of the rowkey seems to be
> critical.
>
> The rowkey for a certain database table of mine is:
>
> timestamp+ipaddress
>
> It looks something like this when performing a scan on the table in the
> shell:
> hbase(main):012:0> scan 'ToyDataTable'
> ROW COLUMN+CELL
>  135702000+192.168.178.9column=CF:SampleCol,
> timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00
>
> Since I got several rows for different timestamps I'd like to tell a
> scan to just a region of the table for example from 2013-01-07 to
> 2013-01-09. Previously I only had a timestamp as the rowkey and I
> could restrict the rowkey like that:
>
> SimpleDateFormat formatter = new SimpleDateFormat("-MM-dd HH:mm:ss");
> Date startDate = formatter.parse("2013-01-07
> 07:00:00");
> Date endDate = formatter.parse("2013-01-10
> 07:00:00");
>
> HTableInterface toyDataTable =
> pool.getTable("ToyDataTable");
> Scan scan = new Scan( Bytes.toBytes(
> startDate.getTime() ),
> Bytes.toBytes( endDate.getTime() ) );
>
> But this no longer works with my new design.
>
> Is there a way to tell the scan object to filter the rows with respect
> to the timestamp, or do I have to use a filter object?
>


Rowkey design question

2013-02-19 Thread Paul van Hoven
Hi,

I'm currently playing with hbase. The design of the rowkey seems to be
critical.

The rowkey for a certain database table of mine is:

timestamp+ipaddress

It looks something like this when performing a scan on the table in the shell:
hbase(main):012:0> scan 'ToyDataTable'
ROW COLUMN+CELL
 135702000+192.168.178.9column=CF:SampleCol,
timestamp=1361288601717, value=Entry_1 = 2013-01-01 07:00:00

Since I got several rows for different timestamps I'd like to tell a
scan to just a region of the table for example from 2013-01-07 to
2013-01-09. Previously I only had a timestamp as the rowkey and I
could restrict the rowkey like that:

SimpleDateFormat formatter = new SimpleDateFormat("-MM-dd HH:mm:ss");
Date startDate = formatter.parse("2013-01-07 07:00:00");
Date endDate = formatter.parse("2013-01-10 07:00:00");

HTableInterface toyDataTable = 
pool.getTable("ToyDataTable");
Scan scan = new Scan( Bytes.toBytes( 
startDate.getTime() ),
Bytes.toBytes( endDate.getTime() ) );

But this no longer works with my new design.

Is there a way to tell the scan object to filter the rows with respect
to the timestamp, or do I have to use a filter object?


Re: coprocessor enabled put very slow, help please~~~

2013-02-19 Thread Michael Segel
Good question.. 

You create a class MyRO. 

How many instances of  MyRO exist per RS?

How many queries can access the instance MyRO at the same time? 




On Feb 19, 2013, at 9:15 AM, Wei Tan  wrote:

> A side question: if HTablePool is not encouraged to be used... how we 
> handle the thread safeness in using HTable? Any replacement for 
> HTablePool, in plan?
> Thanks,
> 
> 
> Best Regards,
> Wei
> 
> 
> 
> 
> From:   Michel Segel 
> To: "user@hbase.apache.org" , 
> Date:   02/18/2013 09:23 AM
> Subject:Re: coprocessor enabled put very slow, help please~~~
> 
> 
> 
> Why are you using an HTable Pool?
> Why are you closing the table after each iteration through?
> 
> Try using 1 HTable object. Turn off WAL
> Initiate in start()
> Close in Stop()
> Surround the use in a try / catch
> If exception caught, re instantiate new HTable connection.
> 
> Maybe want to flush the connection after puts. 
> 
> 
> Again not sure why you are using check and put on the base table. Your 
> count could be off.
> 
> As an example look at poem/rhyme 'Marry had a little lamb'.
> Then check your word count.
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Feb 18, 2013, at 7:21 AM, prakash kadel  
> wrote:
> 
>> Thank you guys for your replies,
>> Michael,
>>  I think i didnt make it clear. Here is my use case,
>> 
>> I have text documents to insert in the hbase. (With possible duplicates)
>> Suppose i have a document as : " I am working. He is not working"
>> 
>> I want to insert this document to a table in hbase, say table "doc"
>> 
>> =doc table=
>> -
>> rowKey : doc_id
>> cf: doc_content
>> value: "I am working. He is not working"
>> 
>> Now, i to create another table that stores the word count, say "doc_idx"
>> 
>> doc_idx table
>> ---
>> rowKey : I, cf: count, value: 1
>> rowKey : am, cf: count, value: 1
>> rowKey : working, cf: count, value: 2
>> rowKey : He, cf: count, value: 1
>> rowKey : is, cf: count, value: 1
>> rowKey : not, cf: count, value: 1
>> 
>> My MR job code:
>> ==
>> 
>> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>>   for(String word : doc_content.split("\\s+")) {
>>  Increment inc = new Increment(Bytes.toBytes(word));
>>  inc.addColumn("count", "", 1);
>>   }
>> }
>> 
>> Now, i wanted to do some experiments with coprocessors. So, i modified
>> the code as follows.
>> 
>> My MR job code:
>> ===
>> 
>> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>> 
>> Coprocessor code:
>> ===
>> 
>>   public void start(CoprocessorEnvironment env)  {
>>   pool = new HTablePool(conf, 100);
>>   }
>> 
>>   public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
>> compareOp, comparator,  put, result) {
>> 
>>   if(!result) return true; // check if the put succeeded
>> 
>>   HTableInterface table_idx = pool.getTable("doc_idx");
>> 
>>   try {
>> 
>>   for(KeyValue contentKV = put.get("doc_content", "")) {
>>   for(String word :
>> contentKV.getValue().split("\\s+")) {
>>   Increment inc = new
>> Increment(Bytes.toBytes(word));
>>   inc.addColumn("count", "", 1);
>>   table_idx.increment(inc);
>>   }
>>  }
>>   } finally {
>>   table_idx.close();
>>   }
>>   return true;
>>   }
>> 
>>   public void stop(env) {
>>   pool.close();
>>   }
>> 
>> I am a newbee to HBASE. I am not sure this is the way to do.
>> Given that, why is the cooprocessor enabled version much slower than
>> the one without?
>> 
>> 
>> Sincerely,
>> Prakash Kadel
>> 
>> 
>> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
>>  wrote:
>>> 
>>> The  issue I was talking about was the use of a check and put.
>>> The OP wrote:
>>> each map inserts to doc table.(checkAndPut)
>>> regionobserver coprocessor does a postCheckAndPut and inserts some 
> rows to
>>> a index table.
>>> 
>>> My question is why does the OP use a checkAndPut, and the 
> RegionObserver's postChecAndPut?
>>> 
>>> 
>>> Here's a good example... 
> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
> 
>>> 
>>> The OP doesn't really get in to the use case, so we don't know why the 
> Check and Put in the M/R job.
>>> He should just be using put() and then a postPut().
>>> 
>>> Another issue... since he's writing to  a different HTable... how? Does 
> he create an HTable instance in the start() method of his RO object and 
> then reference it later? Or does he create the instance of the HTable on 
> the fly in each postCheckAndPut() ?
>>> Without seeing his code, we don't know.
>>> 
>>> Note that this is synchronous set of writes. Your overall return from 
> the M/R call to put will wait until the second row is inserted.
>>> 
>>> Interestingly enough, you may want to consider disabling the WAL on the 
> w

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Varun Sharma
I have another question, if I am running a scan wrapped around multiple
rows in the same region, in the following way:

Scan scan = new scan(getWithMultipleRowsInSameRegion);

Now, how does execution occur. Is it just a sequential scan across the
entire region or does it seek to hfile blocks containing the actual values.
What I truly mean is, lets say the multi get is on following rows:

Row1 : HFileBlock1
Row2 : HFileBlock20
Row3 : Does not exist
Row4 : HFileBlock25
Row5 : HFileBlock100

The efficient way to do this would be to determine the correct blocks using
the index and then searching within the blocks for, say Row1. Then, seek to
HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
seeking to + searching within HFileBlocks as needed.

I am wondering if a scan wrapped around a Get with multiple rows would do
the same ?

Thanks
Varun

On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon  wrote:

> Looking at the code, it seems possible to do this server side within the
> multi invocation: we could group the get by region, and do a single scan.
> We could also add some heuristics if necessary...
>
>
>
> On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl  wrote:
>
> > I should qualify that statement, actually.
> >
> > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> > returned.
> >
> > As James Taylor pointed out to me privately: A fairer comparison would
> > have been to run a scan with a filter that lets x% of the rows pass (i.e.
> > the selectivity of the scan would be x%) and compare that to a multi Get
> of
> > the same x% of the row.
> >
> > There we found that a Scan+Filter is more efficient that issuing multi
> > Gets if x is >= 1-2%.
> >
> >
> > Or in other words, translating many Gets into a Scan+Filter is beneficial
> > if the Scan would return at least 1-2% of the rows to the client.
> > For example:
> > if you are looking for less than 10-20k rows in 1m rows, using muli Gets
> > is likely more efficient.
> > if you are looking for more than 10-20k rows in 1m rows, using a
> > Scan+Filter is likely more efficient.
> >
> >
> > Of course this is predicated on whether you have an efficient way to
> > represent the rows you are looking for in a filter, so that would
> probably
> > shift this slightly more towards Gets (just imaging a Filter that to
> encode
> > 100k random row keys to be matched; since Filters are instantiated store
> > there is another natural limit there).
> >
> >
> > As I said below, the crux of the matter is having some histograms of your
> > data, so that such a decision could be made automatically.
> >
> >
> > -- Lars
> >
> >
> >
> > 
> >  From: lars hofhansl 
> > To: "user@hbase.apache.org" 
> > Sent: Monday, February 18, 2013 5:48 PM
> > Subject: Re: Optimizing Multi Gets in hbase
> >
> > As it happens we did some tests around last week.
> > Turns out doing Gets in batches instead of a scan still gives you 1/3 of
> > the performance.
> >
> > I.e. when you have a table with, say, 10m rows and scanning take N
> > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
> pretty
> > impressive.
> >
> > Now, this is with all data in the cache!
> > When the data is not in the cache and the Gets are random it is many
> > orders of magnitude slower, as the Gets are sprayed all over the disk. In
> > that case sorting the Gets and issuing scans would indeed be much more
> > efficient.
> >
> >
> > The Gets in a batch are already sorted on the client, but as N. says it
> is
> > hard to determine when to turn many Gets into a Scan with filters
> > automatically. Without statistics/histograms I'd even wager a guess that
> > would be impossible to do.
> > Imagine you issue 1 random Gets, but your table has 10bn rows, in
> that
> > case it is almost certain that the Gets are faster than a scan.
> > Now image the Gets only cover a small key range. With statistics we could
> > tell whether it would beneficial to turn this into a scan.
> >
> > It's not that hard to add statistics to HBase. Would do it as part of the
> > compactions, and record the histograms in some table.
> >
> >
> > You can always do that yourself. If you suspect you are touching most
> rows
> > in a table/region, just issue a scan with a appropriate filter (may have
> to
> > implement your own filter, though). Maybe we could a version of RowFilter
> > that match against multiple keys.
> >
> >
> > -- Lars
> >
> >
> >
> > 
> > From: Varun Sharma 
> > To: user@hbase.apache.org
> > Sent: Monday, February 18, 2013 1:57 AM
> > Subject: Optimizing Multi Gets in hbase
> >
> > Hi,
> >
> > I am trying to batched get(s) on a cluster. Here is the code:
> >
> > List gets = ...
> > // Prepare my gets with the rows i need
> > myHTable.get(gets);
> >
> > I have two questions about the above scenario:
> > i) Is this the most optimal way to do this ?
> > ii) I have a feeling that if there are multiple gets in this case, on the

Re: coprocessor enabled put very slow, help please~~~

2013-02-19 Thread Wei Tan
A side question: if HTablePool is not encouraged to be used... how we 
handle the thread safeness in using HTable? Any replacement for 
HTablePool, in plan?
Thanks,


Best Regards,
Wei




From:   Michel Segel 
To: "user@hbase.apache.org" , 
Date:   02/18/2013 09:23 AM
Subject:Re: coprocessor enabled put very slow, help please~~~



Why are you using an HTable Pool?
Why are you closing the table after each iteration through?

Try using 1 HTable object. Turn off WAL
Initiate in start()
Close in Stop()
Surround the use in a try / catch
If exception caught, re instantiate new HTable connection.

Maybe want to flush the connection after puts. 


Again not sure why you are using check and put on the base table. Your 
count could be off.

As an example look at poem/rhyme 'Marry had a little lamb'.
Then check your word count.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 18, 2013, at 7:21 AM, prakash kadel  
wrote:

> Thank you guys for your replies,
> Michael,
>   I think i didnt make it clear. Here is my use case,
> 
> I have text documents to insert in the hbase. (With possible duplicates)
> Suppose i have a document as : " I am working. He is not working"
> 
> I want to insert this document to a table in hbase, say table "doc"
> 
> =doc table=
> -
> rowKey : doc_id
> cf: doc_content
> value: "I am working. He is not working"
> 
> Now, i to create another table that stores the word count, say "doc_idx"
> 
> doc_idx table
> ---
> rowKey : I, cf: count, value: 1
> rowKey : am, cf: count, value: 1
> rowKey : working, cf: count, value: 2
> rowKey : He, cf: count, value: 1
> rowKey : is, cf: count, value: 1
> rowKey : not, cf: count, value: 1
> 
> My MR job code:
> ==
> 
> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>for(String word : doc_content.split("\\s+")) {
>   Increment inc = new Increment(Bytes.toBytes(word));
>   inc.addColumn("count", "", 1);
>}
> }
> 
> Now, i wanted to do some experiments with coprocessors. So, i modified
> the code as follows.
> 
> My MR job code:
> ===
> 
> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
> 
> Coprocessor code:
> ===
> 
>public void start(CoprocessorEnvironment env)  {
>pool = new HTablePool(conf, 100);
>}
> 
>public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
> compareOp, comparator,  put, result) {
> 
>if(!result) return true; // check if the put succeeded
> 
>HTableInterface table_idx = pool.getTable("doc_idx");
> 
>try {
> 
>for(KeyValue contentKV = put.get("doc_content", "")) {
>for(String word :
> contentKV.getValue().split("\\s+")) {
>Increment inc = new
> Increment(Bytes.toBytes(word));
>inc.addColumn("count", "", 1);
>table_idx.increment(inc);
>}
>   }
>} finally {
>table_idx.close();
>}
>return true;
>}
> 
>public void stop(env) {
>pool.close();
>}
> 
> I am a newbee to HBASE. I am not sure this is the way to do.
> Given that, why is the cooprocessor enabled version much slower than
> the one without?
> 
> 
> Sincerely,
> Prakash Kadel
> 
> 
> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
>  wrote:
>> 
>> The  issue I was talking about was the use of a check and put.
>> The OP wrote:
>> each map inserts to doc table.(checkAndPut)
>> regionobserver coprocessor does a postCheckAndPut and inserts some 
rows to
>> a index table.
>> 
>> My question is why does the OP use a checkAndPut, and the 
RegionObserver's postChecAndPut?
>> 
>> 
>> Here's a good example... 
http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put

>> 
>> The OP doesn't really get in to the use case, so we don't know why the 
Check and Put in the M/R job.
>> He should just be using put() and then a postPut().
>> 
>> Another issue... since he's writing to  a different HTable... how? Does 
he create an HTable instance in the start() method of his RO object and 
then reference it later? Or does he create the instance of the HTable on 
the fly in each postCheckAndPut() ?
>> Without seeing his code, we don't know.
>> 
>> Note that this is synchronous set of writes. Your overall return from 
the M/R call to put will wait until the second row is inserted.
>> 
>> Interestingly enough, you may want to consider disabling the WAL on the 
write to the index.  You can always run a M/R job that rebuilds the index 
should something occur to the system where you might lose the data. 
Indexes *ARE* expendable. ;-)
>> 
>> Does that explain it?
>> 
>> -Mike
>> 
>> On Feb 18, 2013, at 4:57 AM, yonghu  wrote:
>> 
>>> Hi, Michael
>>> 
>>> I don't quite understand what do you mean by "round trip back to the
>>> client". In my understandin

Re: Table deleted after restart of computer

2013-02-19 Thread Ibrahim Yakti
Hello Paul,

The default location for hbase data is /tmp so when you restart your
machine it will be deleted, you need to change it as per
http://hbase.apache.org/book.html#quickstart




--
Ibrahim


On Tue, Feb 19, 2013 at 5:54 PM, Ted Yu  wrote:

> Which HBase / hadoop version were you using ?
>
> Did you start the cluster in standalone mode ?
>
> Thanks
>
> On Tue, Feb 19, 2013 at 5:23 AM, Paul van Hoven <
> paul.van.ho...@googlemail.com> wrote:
>
> > I just started with hbase. Therefore I created a table and filled this
> > table with some data. But after restarting my computer all the data
> > has gone. This even happens when stopping hbase with stop-hbase.sh.
> >
> > How can this happen?
> >
>


Re: Table deleted after restart of computer

2013-02-19 Thread Paul van Hoven
I installed hbase via brew.

brew install hadoop hbase pig hive

Then I started hbase via start-hbase.sh command. Therefore I'm pretty
sure it is a standalone version.



2013/2/19 Ted Yu :
> Which HBase / hadoop version were you using ?
>
> Did you start the cluster in standalone mode ?
>
> Thanks
>
> On Tue, Feb 19, 2013 at 5:23 AM, Paul van Hoven <
> paul.van.ho...@googlemail.com> wrote:
>
>> I just started with hbase. Therefore I created a table and filled this
>> table with some data. But after restarting my computer all the data
>> has gone. This even happens when stopping hbase with stop-hbase.sh.
>>
>> How can this happen?
>>


Re: Table deleted after restart of computer

2013-02-19 Thread Ted Yu
Which HBase / hadoop version were you using ?

Did you start the cluster in standalone mode ?

Thanks

On Tue, Feb 19, 2013 at 5:23 AM, Paul van Hoven <
paul.van.ho...@googlemail.com> wrote:

> I just started with hbase. Therefore I created a table and filled this
> table with some data. But after restarting my computer all the data
> has gone. This even happens when stopping hbase with stop-hbase.sh.
>
> How can this happen?
>


Table deleted after restart of computer

2013-02-19 Thread Paul van Hoven
I just started with hbase. Therefore I created a table and filled this
table with some data. But after restarting my computer all the data
has gone. This even happens when stopping hbase with stop-hbase.sh.

How can this happen?


Re: storing lists in columns

2013-02-19 Thread Jean-Marc Spaggiari
Hi Stas,

Don't forget that you should always try to keep the number of columns
families lower than 3, else you might face some performances issues.

JM

2013/2/19, Stas Maksimov :
> Hi Jean-Marc,
>
> I've validated this, it works perfectly. Very easy to implement and it's
> very fast!
>
> Thankfully in this project there isn't a lot of lists in each table, so I
> won't have to create too many column families. In other scenarios it could
> be a problem.
>
> Many thanks,
> Stas
>
>
> On 16 February 2013 02:29, Jean-Marc Spaggiari
> wrote:
>
>> Hi Stas,
>>
>> Few options are coming into my mind.
>>
>> Quickly:
>> 1) Why not storing the products in specif columns instead of in the
>> same one? Like:
>> table, rowid1, cf:list, c:aa, value:true
>> table, rowid1, cf:list, c:bb, value:true
>> table, rowid1, cf:list, c:cc, value:true
>> table, rowid2, cf:list, c:aabb, value:true
>> table, rowid2, cf:list, c:cc, value:true
>> That way when you do a search you query directly the right column for
>> the right row. And using "exist" call with also reduce the size of the
>> data transfered.
>>
>> 2) You can store the data in the oposite way. Like:
>> table, aa, cf:products, c:rowid1, value:true
>> table, aabb, cf:products, c:rowid2, value:true
>> table, bb, cf:products, c:rowid1, value:true
>> table, cc, cf:products, c:rowid1, value:true
>> table, cc, cf:products, c:rowid2, value:true
>> Here, you query by your product ID, and you search the column based on
>> your previous rowid.
>>
>>
>> I will say the 2 solutions are equivalent, but it will really depend
>> on your data pattern and you query pattern.
>>
>> JM
>>
>> 2013/2/15, Stas Maksimov :
>> > Hi all,
>> >
>> > I have a requirement to store lists in HBase columns like this:
>> > "table", "rowid1", "f:list", "aa, bb, cc"
>> > "table", "rowid2", "f:list", "aabb, cc"
>> >
>> > There is a further requirement to be able to find rows where f:list
>> > contains a particular item, e.g. when I need to find rows having item
>> "aa"
>> > only "rowid1" should match, and for item "cc" both "rowid1" and
>> > "rowid2"
>> > should match.
>> >
>> > For now I decided to use SingleColumnValueFilter with substring
>> > matching.
>> > As using comma-separated list proved difficult to search through, I'm
>> using
>> > pipe symbols to separate items like this: "|aa|bb|cc|", so that I could
>> > pass the search item surrounded by pipes into the filter:
>> > SingleColumnValueFilter ('f', 'list', =, 'substring:|aa|')
>> >
>> > This proved to work effectively enough, however I would prefer to use
>> > something more standard for my list storage (e.g. serialised JSON), or
>> > perhaps something even more optimised for a search - performance really
>> > does matter here.
>> >
>> > Any opinions on this solution and possible enhancements are much
>> > appreciated.
>> >
>> > Many thanks,
>> > Stas
>> >
>>
>


Re: storing lists in columns

2013-02-19 Thread Stas Maksimov
Hi Jean-Marc,

I've validated this, it works perfectly. Very easy to implement and it's
very fast!

Thankfully in this project there isn't a lot of lists in each table, so I
won't have to create too many column families. In other scenarios it could
be a problem.

Many thanks,
Stas


On 16 February 2013 02:29, Jean-Marc Spaggiari wrote:

> Hi Stas,
>
> Few options are coming into my mind.
>
> Quickly:
> 1) Why not storing the products in specif columns instead of in the
> same one? Like:
> table, rowid1, cf:list, c:aa, value:true
> table, rowid1, cf:list, c:bb, value:true
> table, rowid1, cf:list, c:cc, value:true
> table, rowid2, cf:list, c:aabb, value:true
> table, rowid2, cf:list, c:cc, value:true
> That way when you do a search you query directly the right column for
> the right row. And using "exist" call with also reduce the size of the
> data transfered.
>
> 2) You can store the data in the oposite way. Like:
> table, aa, cf:products, c:rowid1, value:true
> table, aabb, cf:products, c:rowid2, value:true
> table, bb, cf:products, c:rowid1, value:true
> table, cc, cf:products, c:rowid1, value:true
> table, cc, cf:products, c:rowid2, value:true
> Here, you query by your product ID, and you search the column based on
> your previous rowid.
>
>
> I will say the 2 solutions are equivalent, but it will really depend
> on your data pattern and you query pattern.
>
> JM
>
> 2013/2/15, Stas Maksimov :
> > Hi all,
> >
> > I have a requirement to store lists in HBase columns like this:
> > "table", "rowid1", "f:list", "aa, bb, cc"
> > "table", "rowid2", "f:list", "aabb, cc"
> >
> > There is a further requirement to be able to find rows where f:list
> > contains a particular item, e.g. when I need to find rows having item
> "aa"
> > only "rowid1" should match, and for item "cc" both "rowid1" and "rowid2"
> > should match.
> >
> > For now I decided to use SingleColumnValueFilter with substring matching.
> > As using comma-separated list proved difficult to search through, I'm
> using
> > pipe symbols to separate items like this: "|aa|bb|cc|", so that I could
> > pass the search item surrounded by pipes into the filter:
> > SingleColumnValueFilter ('f', 'list', =, 'substring:|aa|')
> >
> > This proved to work effectively enough, however I would prefer to use
> > something more standard for my list storage (e.g. serialised JSON), or
> > perhaps something even more optimised for a search - performance really
> > does matter here.
> >
> > Any opinions on this solution and possible enhancements are much
> > appreciated.
> >
> > Many thanks,
> > Stas
> >
>


Re: PreSplit the table with Long format

2013-02-19 Thread Farrokh Shahriari
Tnx for your help,but it doesn't work.Do you have any other idea,cause I
must run it from the shell.

Farrokh


On Tue, Feb 19, 2013 at 1:30 PM, Viral Bajaria wrote:

> HBase shell is a jruby shell and so you can invoke any java commands from
> it.
>
> For example:
> > import org.apache.hadoop.hbase.util.Bytes
> > Bytes.toLong(Bytes.toBytes(1000))
>
> Not sure if this works as expected since I don't have a terminal in front
> of me but you could try (assuming the SPLITS keyword takes byte array as
> input, never used SPLITS from the command line):
> create 'testTable', 'cf1' , { SPLITS => [ Bytes.toBytes(1000),
> Bytes.toBytes(2000), Bytes.toBytes(3000) ] }
>
> Thanks,
> Viral
>
> On Tue, Feb 19, 2013 at 1:52 AM, Farrokh Shahriari <
> mohandes.zebeleh...@gmail.com> wrote:
>
> > Hi there
> > As I use rowkey in long format,I must presplit table in long format
> too.But
> > when I've run this command,it presplit the table with STRING format :
> > create 'testTable','cf1',{SPLITS => [ '1000','2000','3000']}
> >
> > How can I presplit the table with Long format ?
> >
> > Farrokh
> >
>


Re: PreSplit the table with Long format

2013-02-19 Thread Viral Bajaria
HBase shell is a jruby shell and so you can invoke any java commands from
it.

For example:
> import org.apache.hadoop.hbase.util.Bytes
> Bytes.toLong(Bytes.toBytes(1000))

Not sure if this works as expected since I don't have a terminal in front
of me but you could try (assuming the SPLITS keyword takes byte array as
input, never used SPLITS from the command line):
create 'testTable', 'cf1' , { SPLITS => [ Bytes.toBytes(1000),
Bytes.toBytes(2000), Bytes.toBytes(3000) ] }

Thanks,
Viral

On Tue, Feb 19, 2013 at 1:52 AM, Farrokh Shahriari <
mohandes.zebeleh...@gmail.com> wrote:

> Hi there
> As I use rowkey in long format,I must presplit table in long format too.But
> when I've run this command,it presplit the table with STRING format :
> create 'testTable','cf1',{SPLITS => [ '1000','2000','3000']}
>
> How can I presplit the table with Long format ?
>
> Farrokh
>


Re: Optimizing Multi Gets in hbase

2013-02-19 Thread Nicolas Liochon
Looking at the code, it seems possible to do this server side within the
multi invocation: we could group the get by region, and do a single scan.
We could also add some heuristics if necessary...



On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl  wrote:

> I should qualify that statement, actually.
>
> I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> returned.
>
> As James Taylor pointed out to me privately: A fairer comparison would
> have been to run a scan with a filter that lets x% of the rows pass (i.e.
> the selectivity of the scan would be x%) and compare that to a multi Get of
> the same x% of the row.
>
> There we found that a Scan+Filter is more efficient that issuing multi
> Gets if x is >= 1-2%.
>
>
> Or in other words, translating many Gets into a Scan+Filter is beneficial
> if the Scan would return at least 1-2% of the rows to the client.
> For example:
> if you are looking for less than 10-20k rows in 1m rows, using muli Gets
> is likely more efficient.
> if you are looking for more than 10-20k rows in 1m rows, using a
> Scan+Filter is likely more efficient.
>
>
> Of course this is predicated on whether you have an efficient way to
> represent the rows you are looking for in a filter, so that would probably
> shift this slightly more towards Gets (just imaging a Filter that to encode
> 100k random row keys to be matched; since Filters are instantiated store
> there is another natural limit there).
>
>
> As I said below, the crux of the matter is having some histograms of your
> data, so that such a decision could be made automatically.
>
>
> -- Lars
>
>
>
> 
>  From: lars hofhansl 
> To: "user@hbase.apache.org" 
> Sent: Monday, February 18, 2013 5:48 PM
> Subject: Re: Optimizing Multi Gets in hbase
>
> As it happens we did some tests around last week.
> Turns out doing Gets in batches instead of a scan still gives you 1/3 of
> the performance.
>
> I.e. when you have a table with, say, 10m rows and scanning take N
> seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty
> impressive.
>
> Now, this is with all data in the cache!
> When the data is not in the cache and the Gets are random it is many
> orders of magnitude slower, as the Gets are sprayed all over the disk. In
> that case sorting the Gets and issuing scans would indeed be much more
> efficient.
>
>
> The Gets in a batch are already sorted on the client, but as N. says it is
> hard to determine when to turn many Gets into a Scan with filters
> automatically. Without statistics/histograms I'd even wager a guess that
> would be impossible to do.
> Imagine you issue 1 random Gets, but your table has 10bn rows, in that
> case it is almost certain that the Gets are faster than a scan.
> Now image the Gets only cover a small key range. With statistics we could
> tell whether it would beneficial to turn this into a scan.
>
> It's not that hard to add statistics to HBase. Would do it as part of the
> compactions, and record the histograms in some table.
>
>
> You can always do that yourself. If you suspect you are touching most rows
> in a table/region, just issue a scan with a appropriate filter (may have to
> implement your own filter, though). Maybe we could a version of RowFilter
> that match against multiple keys.
>
>
> -- Lars
>
>
>
> 
> From: Varun Sharma 
> To: user@hbase.apache.org
> Sent: Monday, February 18, 2013 1:57 AM
> Subject: Optimizing Multi Gets in hbase
>
> Hi,
>
> I am trying to batched get(s) on a cluster. Here is the code:
>
> List gets = ...
> // Prepare my gets with the rows i need
> myHTable.get(gets);
>
> I have two questions about the above scenario:
> i) Is this the most optimal way to do this ?
> ii) I have a feeling that if there are multiple gets in this case, on the
> same region, then each one of those shall instantiate separate scan(s) over
> the region even though a single scan is sufficient. Am I mistaken here ?
>
> Thanks
> Varun
>


Re: HBase without compactions?

2013-02-19 Thread lars hofhansl
If you store data in LSM trees you need compactions.
The advantage is that your data files are immutable.
MapR has a mutable file system and they probably store their data in something 
more akin to B-Trees...?
Or maybe they somehow avoid the expensive merge sorting of many small files. It 
seems that is has to be one or the other.

(Maybe somebody from MapR reads this and can explain how it actually works.)

Compations let you trade random IO for sequential IO (just to state the 
obvious). It seems that you can't have it both ways.

-- Lars




 From: Otis Gospodnetic 
To: user@hbase.apache.org 
Sent: Monday, February 18, 2013 7:30 PM
Subject: HBase without compactions?
 
Hello,

It's kind of funny, we run SPM, which includes SPM for HBase (performance
monitoring service/tool for HBase essentially) and we currently store all
performance metrics in HBase.

I see a ton of HBase development activity, which is great, but it just
occurred to me that I don't think I recall seeing anything about getting
rid of compactions.  Yet, compactions are one thing that I know hurt us the
most and is one thing that MapR somehow got rid of in their implementation.

Have there been any discussions,attempts, or thoughts about finding a way
to avoid compactions?

Thanks,
Otis
--
HBASE Performance Monitoring - http://sematext.com/spm/index.html

Re: Optimizing Multi Gets in hbase

2013-02-19 Thread lars hofhansl
I should qualify that statement, actually.

I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned.

As James Taylor pointed out to me privately: A fairer comparison would have 
been to run a scan with a filter that lets x% of the rows pass (i.e. the 
selectivity of the scan would be x%) and compare that to a multi Get of the 
same x% of the row.

There we found that a Scan+Filter is more efficient that issuing multi Gets if 
x is >= 1-2%.


Or in other words, translating many Gets into a Scan+Filter is beneficial if 
the Scan would return at least 1-2% of the rows to the client.
For example:
if you are looking for less than 10-20k rows in 1m rows, using muli Gets is 
likely more efficient.
if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is 
likely more efficient.


Of course this is predicated on whether you have an efficient way to represent 
the rows you are looking for in a filter, so that would probably shift this 
slightly more towards Gets (just imaging a Filter that to encode 100k random 
row keys to be matched; since Filters are instantiated store there is another 
natural limit there).


As I said below, the crux of the matter is having some histograms of your data, 
so that such a decision could be made automatically.


-- Lars




 From: lars hofhansl 
To: "user@hbase.apache.org"  
Sent: Monday, February 18, 2013 5:48 PM
Subject: Re: Optimizing Multi Gets in hbase
 
As it happens we did some tests around last week.
Turns out doing Gets in batches instead of a scan still gives you 1/3 of the 
performance.

I.e. when you have a table with, say, 10m rows and scanning take N seconds, 
then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive.

Now, this is with all data in the cache!
When the data is not in the cache and the Gets are random it is many orders of 
magnitude slower, as the Gets are sprayed all over the disk. In that case 
sorting the Gets and issuing scans would indeed be much more efficient.


The Gets in a batch are already sorted on the client, but as N. says it is hard 
to determine when to turn many Gets into a Scan with filters automatically. 
Without statistics/histograms I'd even wager a guess that would be impossible 
to do.
Imagine you issue 1 random Gets, but your table has 10bn rows, in that case 
it is almost certain that the Gets are faster than a scan.
Now image the Gets only cover a small key range. With statistics we could tell 
whether it would beneficial to turn this into a scan.

It's not that hard to add statistics to HBase. Would do it as part of the 
compactions, and record the histograms in some table.


You can always do that yourself. If you suspect you are touching most rows in a 
table/region, just issue a scan with a appropriate filter (may have to 
implement your own filter, though). Maybe we could a version of RowFilter that 
match against multiple keys.


-- Lars




From: Varun Sharma 
To: user@hbase.apache.org 
Sent: Monday, February 18, 2013 1:57 AM
Subject: Optimizing Multi Gets in hbase

Hi,

I am trying to batched get(s) on a cluster. Here is the code:

List gets = ...
// Prepare my gets with the rows i need
myHTable.get(gets);

I have two questions about the above scenario:
i) Is this the most optimal way to do this ?
ii) I have a feeling that if there are multiple gets in this case, on the
same region, then each one of those shall instantiate separate scan(s) over
the region even though a single scan is sufficient. Am I mistaken here ?

Thanks
Varun