Re: HBase Thrift - HTTP - Kerberos & SPNEGO

2018-01-11 Thread Kevin Risden
"HBase Thrift2 "implementation" makes more sense to me"

I agree with that statement since thrift2 follows the HBase API more
closely. Thrift 1 follows the old HBase API. I don't think using Thrift2
with Hue is an option right now. It still interacts with Thrift 1. (I'm not
really looking to rewrite the Hue HBase Thrift module) There didn't look to
be much code shared between Thrift 1 and Thrift 2 server implementations.
Thrift 1 looks very much like HiveServer2 and the 401 bail out early might
also apply there.

I'll open a JIRA and throw up a patch hopefully this week.

Kevin Risden

On Thu, Jan 11, 2018 at 9:50 AM, Josh Elser  wrote:

> Hey Kevin!
>
> Looks like you got some good changes in here.
>
> IMO, the HBase Thrift2 "implementation" makes more sense to me (I'm sure
> there was a reason for having HTTP be involved at one point, but Thrift
> today has the ability to do all of this RPC work for us). I'm not sure what
> the HBase API implementations look like between the two versions.
>
> If you'd like to open up a JIRA and throw up a patch, you'd definitely
> have my attention if no one else's :)
>
>
> On 1/11/18 9:31 AM, Kevin Risden wrote:
>
>> I'm not 100% sure this should be posted to user list, but starting here
>> before dev list/JIRA.
>>
>> I've been working on setting up the Hue HBase and it requires HBase Thrift
>> v1 server. To support impersonation/proxyuser, the documentation states
>> that this must be done with HTTP and not binary mode. The cluster has
>> Kerberos and so the final setup ends up being HBase Thrift in HTTP mode
>> with Kerberos.
>>
>> While setting up the HBase Thrift server with HTTP, there were a
>> significant amount of 401 errors where the HBase Thrift wasn't able to
>> handle the incoming Kerberos request. Documentation online is sparse when
>> it comes to setting up the principal/keytab for HTTP Kerberos.
>>
>> I noticed that the HBase Thrift HTTP implementation was missing SPNEGO
>> principal/keytab like other Thrift based servers (HiveServer2). It looks
>> like HiveServer2 Thrift implementation and HBase Thrift v1 implementation
>> were very close to the same at one point. I made the following changes to
>> HBase Thrift v1 server implementation to make it work:
>> * add SPNEGO principal/keytab if in HTTP mode
>> * return 401 immediately if no authorization header instead of waiting for
>> try/catch down in program flow
>>
>> The code changes are available here:
>> https://github.com/risdenk/hortonworks-hbase-release/compare
>> /HDP-2.5.3.126-base...fix_hbase_thrift_spnego
>>
>> Does this seem like the right approach?
>>
>> The same types of changes should apply to master as well. If this looks
>> reasonable, I can create a JIRA and generate patch against Apache HBase
>> master.
>>
>> Side note: I saw the notes about HBase Thrift v1 was meant to go away at
>> some point but looks like it is still being depended on.
>>
>> Kevin Risden
>>
>>


Re: HBase Thrift - HTTP - Kerberos & SPNEGO

2018-01-11 Thread Josh Elser

Hey Kevin!

Looks like you got some good changes in here.

IMO, the HBase Thrift2 "implementation" makes more sense to me (I'm sure 
there was a reason for having HTTP be involved at one point, but Thrift 
today has the ability to do all of this RPC work for us). I'm not sure 
what the HBase API implementations look like between the two versions.


If you'd like to open up a JIRA and throw up a patch, you'd definitely 
have my attention if no one else's :)


On 1/11/18 9:31 AM, Kevin Risden wrote:

I'm not 100% sure this should be posted to user list, but starting here
before dev list/JIRA.

I've been working on setting up the Hue HBase and it requires HBase Thrift
v1 server. To support impersonation/proxyuser, the documentation states
that this must be done with HTTP and not binary mode. The cluster has
Kerberos and so the final setup ends up being HBase Thrift in HTTP mode
with Kerberos.

While setting up the HBase Thrift server with HTTP, there were a
significant amount of 401 errors where the HBase Thrift wasn't able to
handle the incoming Kerberos request. Documentation online is sparse when
it comes to setting up the principal/keytab for HTTP Kerberos.

I noticed that the HBase Thrift HTTP implementation was missing SPNEGO
principal/keytab like other Thrift based servers (HiveServer2). It looks
like HiveServer2 Thrift implementation and HBase Thrift v1 implementation
were very close to the same at one point. I made the following changes to
HBase Thrift v1 server implementation to make it work:
* add SPNEGO principal/keytab if in HTTP mode
* return 401 immediately if no authorization header instead of waiting for
try/catch down in program flow

The code changes are available here:
https://github.com/risdenk/hortonworks-hbase-release/compare/HDP-2.5.3.126-base...fix_hbase_thrift_spnego

Does this seem like the right approach?

The same types of changes should apply to master as well. If this looks
reasonable, I can create a JIRA and generate patch against Apache HBase
master.

Side note: I saw the notes about HBase Thrift v1 was meant to go away at
some point but looks like it is still being depended on.

Kevin Risden



Re: Avoiding duplicate writes

2018-01-11 Thread Ted Yu
Peter:
Normally java.lang.System.nanoTime() is used for measuring duration of time.

See also
https://www.javacodegeeks.com/2012/02/what-is-behind-systemnanotime.html

bq. the prePut co-processor is executed inside a record lock

The prePut hook is called with read lock on the underlying region.


Have you heard of HLC ? See HBASE-14070

The work hasn't been active recently.

FYI

On Thu, Jan 11, 2018 at 2:16 AM, Peter Marron 
wrote:

> Hi,
>
> We have a problem when we are writing lots of records to HBase.
> We are not specifying timestamps explicitly and so the situation arises
> where multiple records are being written in the same millisecond.
> Unfortunately when the records are written and the timestamps are the same
> then later writes are treated as updates of the previous records and not
> separate records, which is what we want.
> So we want to be able to guarantee that records are not treated as
> overwrites (unless we explicitly make them so).
>
> As I understand it there are (at least) two different ways to proceed.
>
> The first approach is to increase the resolution of the timestamp.
> So we could use something like java.lang.System.nanoTime()
> However although this seems to ameliorate the problem it seems to
> introduce other problems.
> Also ideally we would like something that guarantees that we don't lose
> writes rather than making them more unlikely.
>
> The second approach is to write a prePut co-processor.
> In the prePut I can do a read using the same rowkey, column family and
> column qualifier and omit the timestamp.
> As I understand it this will return me the latest timestamp.
> Then I can update the timestamp that I am going to write, if necessary, to
> make sure that the timestamp is always unique.
> In this way I can guarantee that none of my writes are accidentally turned
> into updates.
>
> However this approach seems to be expensive.
> I have to do a read before each write, and although (I believe) it will be
> on the same region server, it's still going to slow things down a lot.
> Also I am assuming that the prePut co-processor is executed inside a
> record lock so that I don't have to worry about synchronization.
> Is this true?
>
> Is there a better way?
>
> Maybe there is some implementation of this already that I can pick up?
>
> Maybe there is some way that I can implement this more efficiently?
>
> It seems to me that this might be better handled at compaction.
> Shouldn't there be some way that I can mark writes with some sort of
> special value of timestamp that means that this write should never be
> considered as an update but always as a separate write?
>
> Any advice gratefully received.
>
> Peter Marron
>


HBase Thrift - HTTP - Kerberos & SPNEGO

2018-01-11 Thread Kevin Risden
I'm not 100% sure this should be posted to user list, but starting here
before dev list/JIRA.

I've been working on setting up the Hue HBase and it requires HBase Thrift
v1 server. To support impersonation/proxyuser, the documentation states
that this must be done with HTTP and not binary mode. The cluster has
Kerberos and so the final setup ends up being HBase Thrift in HTTP mode
with Kerberos.

While setting up the HBase Thrift server with HTTP, there were a
significant amount of 401 errors where the HBase Thrift wasn't able to
handle the incoming Kerberos request. Documentation online is sparse when
it comes to setting up the principal/keytab for HTTP Kerberos.

I noticed that the HBase Thrift HTTP implementation was missing SPNEGO
principal/keytab like other Thrift based servers (HiveServer2). It looks
like HiveServer2 Thrift implementation and HBase Thrift v1 implementation
were very close to the same at one point. I made the following changes to
HBase Thrift v1 server implementation to make it work:
* add SPNEGO principal/keytab if in HTTP mode
* return 401 immediately if no authorization header instead of waiting for
try/catch down in program flow

The code changes are available here:
https://github.com/risdenk/hortonworks-hbase-release/compare/HDP-2.5.3.126-base...fix_hbase_thrift_spnego

Does this seem like the right approach?

The same types of changes should apply to master as well. If this looks
reasonable, I can create a JIRA and generate patch against Apache HBase
master.

Side note: I saw the notes about HBase Thrift v1 was meant to go away at
some point but looks like it is still being depended on.

Kevin Risden


Re: Avoiding duplicate writes

2018-01-11 Thread Lalit Jadhav
Hello Peter,

You can add a Random number in Row key for avoiding Rowkey overriding.
Even though timeStamp at one ms is same Random Number provides uniqueness.

On Thu, Jan 11, 2018 at 3:46 PM, Peter Marron 
wrote:

> Hi,
>
> We have a problem when we are writing lots of records to HBase.
> We are not specifying timestamps explicitly and so the situation arises
> where multiple records are being written in the same millisecond.
> Unfortunately when the records are written and the timestamps are the same
> then later writes are treated as updates of the previous records and not
> separate records, which is what we want.
> So we want to be able to guarantee that records are not treated as
> overwrites (unless we explicitly make them so).
>
> As I understand it there are (at least) two different ways to proceed.
>
> The first approach is to increase the resolution of the timestamp.
> So we could use something like java.lang.System.nanoTime()
> However although this seems to ameliorate the problem it seems to
> introduce other problems.
> Also ideally we would like something that guarantees that we don't lose
> writes rather than making them more unlikely.
>
> The second approach is to write a prePut co-processor.
> In the prePut I can do a read using the same rowkey, column family and
> column qualifier and omit the timestamp.
> As I understand it this will return me the latest timestamp.
> Then I can update the timestamp that I am going to write, if necessary, to
> make sure that the timestamp is always unique.
> In this way I can guarantee that none of my writes are accidentally turned
> into updates.
>
> However this approach seems to be expensive.
> I have to do a read before each write, and although (I believe) it will be
> on the same region server, it's still going to slow things down a lot.
> Also I am assuming that the prePut co-processor is executed inside a
> record lock so that I don't have to worry about synchronization.
> Is this true?
>
> Is there a better way?
>
> Maybe there is some implementation of this already that I can pick up?
>
> Maybe there is some way that I can implement this more efficiently?
>
> It seems to me that this might be better handled at compaction.
> Shouldn't there be some way that I can mark writes with some sort of
> special value of timestamp that means that this write should never be
> considered as an update but always as a separate write?
>
> Any advice gratefully received.
>
> Peter Marron
>



-- 
Regards,
Lalit Jadhav
Network Component Private Limited.


Avoiding duplicate writes

2018-01-11 Thread Peter Marron
Hi,

We have a problem when we are writing lots of records to HBase.
We are not specifying timestamps explicitly and so the situation arises where 
multiple records are being written in the same millisecond.
Unfortunately when the records are written and the timestamps are the same then 
later writes are treated as updates of the previous records and not separate 
records, which is what we want.
So we want to be able to guarantee that records are not treated as overwrites 
(unless we explicitly make them so).

As I understand it there are (at least) two different ways to proceed.

The first approach is to increase the resolution of the timestamp.
So we could use something like java.lang.System.nanoTime()
However although this seems to ameliorate the problem it seems to introduce 
other problems.
Also ideally we would like something that guarantees that we don't lose writes 
rather than making them more unlikely.

The second approach is to write a prePut co-processor.
In the prePut I can do a read using the same rowkey, column family and column 
qualifier and omit the timestamp.
As I understand it this will return me the latest timestamp.
Then I can update the timestamp that I am going to write, if necessary, to make 
sure that the timestamp is always unique.
In this way I can guarantee that none of my writes are accidentally turned into 
updates.

However this approach seems to be expensive.
I have to do a read before each write, and although (I believe) it will be on 
the same region server, it's still going to slow things down a lot.
Also I am assuming that the prePut co-processor is executed inside a record 
lock so that I don't have to worry about synchronization.
Is this true?

Is there a better way?

Maybe there is some implementation of this already that I can pick up?

Maybe there is some way that I can implement this more efficiently?

It seems to me that this might be better handled at compaction.
Shouldn't there be some way that I can mark writes with some sort of special 
value of timestamp that means that this write should never be considered as an 
update but always as a separate write?

Any advice gratefully received.

Peter Marron


RE: CompleteBulkLoad Error

2018-01-11 Thread ashish singhi
As per my understanding the online HBase book refers to master version.

For the version specific document we can refer to book which is part of release 
tar,  hbase-1.2.6-bin.tar.gz\hbase-1.2.6\docs\book.pdf

Regards,
Ashish

-Original Message-
From: Yung-An He [mailto:mathst...@gmail.com] 
Sent: Thursday, January 11, 2018 2:21 PM
To: user@hbase.apache.org
Subject: Re: CompleteBulkLoad Error

Ankit and Ashish, thanks for reply,

I saw the ImportTsv command
`org.apache.hadoop.hbase.tool.LoadIncrementalHFiles`
from the HBase book  on the 
website, and according to official documents to run the command. But the 
command is for HBase-2.0.

Perhaps someone has the same situation with me.
If there are official reference guides for Individual version, and the 
information would be more clear.


Regards,
Yung-An

2018-01-11 15:06 GMT+08:00 ashish singhi :

> Hi,
>
> The path of tool you are passing is wrong, it is org.apache.hadoop.hbase.
> mapreduce.LoadIncrementalHFiles.
> So the command will be, hbase 
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
> hdfs://hbase-master:9000/tmp/bktableoutput bktable
>
> Regards,
> Ashish
>
> -Original Message-
> From: Yung-An He [mailto:mathst...@gmail.com]
> Sent: Thursday, January 11, 2018 12:19 PM
> To: user@hbase.apache.org
> Subject: CompleteBulkLoad Error
>
> Hi,
>
> I import data from files to HBase table via the ImportTsv command as below:
>
> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> -Dimporttsv.columns=HBASE_ROW_KEY,cf:c1,cf:c2-Dimporttsv.
> skip.bad.lines=false
> '-Dimporttsv.separator=,'
> -Dimporttsv.bulk.output=hdfs://hbase-master:9000/tmp/bktableoutput
> bktable hdfs://hbase-master:9000/tmp/importsv
>
> and the MR job runs successfully. When I execute the completebulkload 
> command as below:
>
> hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles
> hdfs://hbase-master:9000/tmp/bktableoutput bktable
>
> and it throws the exception:
> Error: Could not find or load main class org.apache.hadoop.hbase.tool.
> LoadIncrementalHFiles
>
> I try the other command:
> HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` 
> ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-server-1.2.6.jar
> completebulkload hdfs://hbase-master:9000/tmp/bktableoutput bktable
>
> and it succeeds.
>
> Does anyone have the idea?
>
>
> Here is the information of HBase cluster :
>
> * HBase version 1.2.6
> * Hadoop version 2.7.5
> * With 5 work nodes.
>


Re: CompleteBulkLoad Error

2018-01-11 Thread Yung-An He
Ankit and Ashish, thanks for reply,

I saw the ImportTsv command
`org.apache.hadoop.hbase.tool.LoadIncrementalHFiles`
from the HBase book  on
the website,
and according to official documents to run the command. But the command is
for HBase-2.0.

Perhaps someone has the same situation with me.
If there are official reference guides for Individual version, and the
information would be more clear.


Regards,
Yung-An

2018-01-11 15:06 GMT+08:00 ashish singhi :

> Hi,
>
> The path of tool you are passing is wrong, it is org.apache.hadoop.hbase.
> mapreduce.LoadIncrementalHFiles.
> So the command will be, hbase 
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
> hdfs://hbase-master:9000/tmp/bktableoutput bktable
>
> Regards,
> Ashish
>
> -Original Message-
> From: Yung-An He [mailto:mathst...@gmail.com]
> Sent: Thursday, January 11, 2018 12:19 PM
> To: user@hbase.apache.org
> Subject: CompleteBulkLoad Error
>
> Hi,
>
> I import data from files to HBase table via the ImportTsv command as below:
>
> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv
> -Dimporttsv.columns=HBASE_ROW_KEY,cf:c1,cf:c2-Dimporttsv.
> skip.bad.lines=false
> '-Dimporttsv.separator=,'
> -Dimporttsv.bulk.output=hdfs://hbase-master:9000/tmp/bktableoutput
> bktable hdfs://hbase-master:9000/tmp/importsv
>
> and the MR job runs successfully. When I execute the completebulkload
> command as below:
>
> hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles
> hdfs://hbase-master:9000/tmp/bktableoutput bktable
>
> and it throws the exception:
> Error: Could not find or load main class org.apache.hadoop.hbase.tool.
> LoadIncrementalHFiles
>
> I try the other command:
> HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath`
> ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-server-1.2.6.jar
> completebulkload hdfs://hbase-master:9000/tmp/bktableoutput bktable
>
> and it succeeds.
>
> Does anyone have the idea?
>
>
> Here is the information of HBase cluster :
>
> * HBase version 1.2.6
> * Hadoop version 2.7.5
> * With 5 work nodes.
>