Re: HBase Replication Between Two Secure Clusters With Different Kerberos KDC's

2018-05-23 Thread Saad Mufti
Thanks, here it is:

1. On the source cluster, will be identical on the target cluster as both
use the same Kerberos realm name, though each has its own cluster specific
KDC:

$ more /etc/zookeeper/conf/server-jaas.conf
>
> /**
>
>  * Licensed to the Apache Software Foundation (ASF) under one
>
>  * or more contributor license agreements.  See the NOTICE file
>
>  * distributed with this work for additional information
>
>  * regarding copyright ownership.  The ASF licenses this file
>
>  * to you under the Apache License, Version 2.0 (the
>
>  * "License"); you may not use this file except in compliance
>
>  * with the License.  You may obtain a copy of the License at
>
>  * 
>
>  * http://www.apache.org/licenses/LICENSE-2.0
>
>  * 
>
>  * Unless required by applicable law or agreed to in writing, software
>
>  * distributed under the License is distributed on an "AS IS" BASIS,
>
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
>
>  * See the License for the specific language governing permissions and
>
>  * limitations under the License.
>
>  */
>
> Server {
>
>   com.sun.security.auth.module.Krb5LoginModule required
>
>   useKeyTab=true
>
>   keyTab="/etc/zookeeper.keytab"
>
>   storeKey=true
>
>   useTicketCache=false
>
>   principal="zookeeper/@PGS.dev";
>
> };
>
>
2. I ran zkCli.sh after authenticating as kerberos principal
zookeeper/@PGS.dev got the following:

getAcl /hbase
>
> 'world,'anyone
>
> : r
>
> 'sasl,'hbase
>
> : cdrwa
>
>
3. I was logged in as Kerberos principal hbase/@PGS.dev when I ran
the add_peer command

Thanks for taking the time to help me in any way you can.


Saad


On Wed, May 23, 2018 at 7:24 AM, Reid Chan <reidddc...@outlook.com> wrote:

> Three places to check,
>
>
>   1.  Would you mind showing your "/etc/zookeeper/conf/server-jaas.conf",
>
> 2. and using zkCli.sh to getAcl /hbase.
> 3. BTW, what was your login principal when executing "add_peer" in
> hbase shell.
> 
> From: Saad Mufti <saad.mu...@gmail.com>
> Sent: 23 May 2018 01:48:17
> To: user@hbase.apache.org
> Subject: HBase Replication Between Two Secure Clusters With Different
> Kerberos KDC's
>
> Hi,
>
> Here is my scenario, I have two secure/authenticated EMR based HBase
> clusters, both have their own cluster dedicated KDC (using EMR support for
> this which means we get Kerberos support by just turning on a config flag).
>
> Now we want to get replication going between them. For other application
> reasons, we want both clusters to have the same Kerberos realm, let's say
> APP.COM, so Kerberos principals are like a...@app.com .
>
> I looked around the web and found the instructions at
> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/
> bk_hadoop-high-availability/content/hbase-cluster-
> replication-among-kerberos-secured-clusters.html
> so I tried to follow these directions. Of course the instructions are for
> replication between clusters with different realms, so I adapted this by
> adding only one principal "krbtgt/app@app.com" and gave it some
> arbitrary password. Followed the rest of the directions as well to pass a
> rule property to Zookeeper and the requisite Hadoop property in
> core-site.xml .
>
> After all this, when I set up replication from cluster1 to cluster using
> add_peer, I see error messages in the region servers for cluster1 of the
> following form:
>
>
>
> > 2018-05-22 17:27:45,763 INFO  [main-SendThread(xxx.net:2181)]
> > zookeeper.ClientCnxn: Opening socket connection to server i
> >
> > p-10-194-247-88.aolp-ds-dev.us-east-1.ec2.aolcloud.net/xxx.yyy.zzz:2181.
> > Will attempt to SASL-authenticate using Login Context section 'Client'
> >
> > 2018-05-22 17:27:45,764 INFO  [main-SendThread(xxx.net:2181)]
> > zookeeper.ClientCnxn: Socket connection established to ip-1
> >
> > 0-194-247-88.aolp-ds-dev.us-east-1.ec2.aolcloud.net/xxx.yyy.zzz:2181,
> > initiating session
> >
> > 2018-05-22 17:27:45,777 INFO  [main-SendThread(xxx.net:2181)]
> > zookeeper.ClientCnxn: Session establishment complete on ser
> >
> > ver xxx.net/xxx.yyy.zzz:2181, sessionid = 0x16388599b300215, negotiated
> > timeout = 4
> > 2018-05-22 17:27:45,779 ERROR [main-SendThread(xxx.net:2181)]
> > client.ZooKeeperSaslClient: An error: (java.security.Privil
> >
> > egedActionException: javax.security.sasl.SaslException: GSS initiate
> > failed [Caused by GSSException: No valid credentials p

Re: Re:Got Duplicate Records for the Same Row Key from a Snapshot

2018-05-22 Thread Saad Mufti
I am not clear how your snapshot even succeeds if this is the case. The
snapshot taking procedure includes  a check for consistency at the end and
throws an exception on problems like this. I would run an hbck command on
your table to check if there are any consistency errors. It also has repair
options but you have to be careful with those. But running it in just
checking mode doesn't change anything and will give you useful feedback.

Hope this helps.


Saad


On Fri, May 18, 2018 at 3:56 AM, shanghaihyj  wrote:

> We find that the metadata of offline regions are included in the snapshot.
>
>
> When we query a table, offline regions are not considered.
> When we query a snapshot of this table, offline regions are included.
> These offline regions refer to the same data in HDFS.  That is why
> duplicate records are returned from the snapshot.
>
>
> Any suggestion how to handle this gracefully ?
>
>
>
> At 2018-05-17 19:04:17, "shanghaihyj"  wrote:
> >We are loading data from the HBase table or its snapshot by hbase-rdd (
> https://github.com/unicredit/hbase-rdd). It uses TableInputFormat /
> TableSnapshotInputFormat as the underlying input format.
> >The scaner has max version set to 1.
> >
> >
> >
> >At 2018-05-17 15:35:08, "shanghaihyj"  wrote:
> >
> >When we query a table by a particular row key, there is only one row
> returned by HBase, which is expected.
> >However, when we query a snapshot for that same table, by the same
> particular row key, five duplicate rows are returned.  Why ?
> >
> >
> >
> >
> >In the log of the master server, we see some snapshot-related error:
> >= ERROR START =
> >ERROR [master:sh-bs-3-b8-namenode-17-208:6.archivedHFileCleaner]
> snapshot.SnapshotHFileCleaner: Exception while checking if files were
> valid, keeping them just in case.
> >./hbase-root-master-sh-bs-3-b8-namenode-17-208.log.7:org.
> apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Couldn't read
> snapshot info from:hdfs://master1.hh:8020/hbase/.hbase-snapshot/.tmp/hb_
> anchor_original_total_7days_stat_1526423587063/.snapshotinfo
> >./hbase-root-master-sh-bs-3-b8-namenode-17-208.log.7-   at
> org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.
> readSnapshotInfo(SnapshotDescriptionUtils.java:325)
> >./hbase-root-master-sh-bs-3-b8-namenode-17-208.log.7-   at
> org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.getHFileNames(
> SnapshotReferenceUtil.java:328)
> >./hbase-root-master-sh-bs-3-b8-namenode-17-208.log.7-   at
> org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner$1.
> filesUnderSnapshot(SnapshotHFileCleaner.java:85)
> >./hbase-root-master-sh-bs-3-b8-namenode-17-208.log.7-   at
> org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.
> getSnapshotsInProgress(SnapshotFileCache.java:303)
> >./hbase-root-master-sh-bs-3-b8-namenode-17-208.log.7-   at
> org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.
> getUnreferencedFiles(SnapshotFileCache.java:194)
> >./hbase-root-master-sh-bs-3-b8-namenode-17-208.log.7-   at
> org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner.
> getDeletableFiles(SnapshotHFileCleaner.java:62)
> >./hbase-root-master-sh-bs-3-b8-namenode-17-208.log.7-   at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(
> CleanerChore.java:233)
> >./hbase-root-master-sh-bs-3-b8-namenode-17-208.log.7-   at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(
> CleanerChore.java:157)
> >...
> >= ERROR END =
> >And we find a related issue for this error: https://issues.apache.org/
> jira/browse/HBASE-16464?attachmentSortBy=fileName
> >
> >
> >However, there is no proof that the error in the log is related to our
> problem of having duplicate records from a snapshot.
> >Our HBase version is 0.98.18-hadoop2.
> >
> >
> >Could you help give some hint why we are having duplicate records from
> the snapshot ?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>


Re: Spark UNEVENLY distributing data

2018-05-22 Thread Saad Mufti
I think TableInputFormat will try to maintain as much locality as possible,
assigning one Spark partition per region and trying to assign that
partition to a YARN container/executor on the same node (assuming you're
using Spark over YARN). So the reason for the uneven distribution could be
that your HBase is not balanced to begin with and has too many regions on
the same region server corresponding to your largest bar. It all depends on
what HBase balancer you have configured and tuned. Assuming that is
properly configured, try to balance your HBase cluster before running the
Spark job. Tere are command s in hbase shell to do it manually if required.

Hope this helps.


Saad


On Sat, May 19, 2018 at 6:40 PM, Alchemist 
wrote:

> I am trying to parallelize a simple Spark program processes HBASE data in
> parallel.
>
> // Get Hbase RDD
> JavaPairRDD hBaseRDD = jsc
> .newAPIHadoopRDD(conf, TableInputFormat.class,
> ImmutableBytesWritable.class, Result.class);
> long count = hBaseRDD.count();
>
> Only two lines I see in the logs.  Zookeeper starts and Zookeeper stops
>
>
> The problem is my program is as SLOW as the largest bar. Found that ZK is 
> taking long time before shutting.
> 18/05/19 17:26:55 INFO zookeeper.ClientCnxn: Session establishment complete 
> on server :2181, sessionid = 0x163662b64eb046d, negotiated timeout = 4 
> 18/05/19
> 17:38:00 INFO zookeeper.ZooKeeper: Session: 0x163662b64eb046d closed
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


HBase Replication Between Two Secure Clusters With Different Kerberos KDC's

2018-05-22 Thread Saad Mufti
Hi,

Here is my scenario, I have two secure/authenticated EMR based HBase
clusters, both have their own cluster dedicated KDC (using EMR support for
this which means we get Kerberos support by just turning on a config flag).

Now we want to get replication going between them. For other application
reasons, we want both clusters to have the same Kerberos realm, let's say
APP.COM, so Kerberos principals are like a...@app.com .

I looked around the web and found the instructions at
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_hadoop-high-availability/content/hbase-cluster-replication-among-kerberos-secured-clusters.html
so I tried to follow these directions. Of course the instructions are for
replication between clusters with different realms, so I adapted this by
adding only one principal "krbtgt/app@app.com" and gave it some
arbitrary password. Followed the rest of the directions as well to pass a
rule property to Zookeeper and the requisite Hadoop property in
core-site.xml .

After all this, when I set up replication from cluster1 to cluster using
add_peer, I see error messages in the region servers for cluster1 of the
following form:



> 2018-05-22 17:27:45,763 INFO  [main-SendThread(xxx.net:2181)]
> zookeeper.ClientCnxn: Opening socket connection to server i
>
> p-10-194-247-88.aolp-ds-dev.us-east-1.ec2.aolcloud.net/xxx.yyy.zzz:2181.
> Will attempt to SASL-authenticate using Login Context section 'Client'
>
> 2018-05-22 17:27:45,764 INFO  [main-SendThread(xxx.net:2181)]
> zookeeper.ClientCnxn: Socket connection established to ip-1
>
> 0-194-247-88.aolp-ds-dev.us-east-1.ec2.aolcloud.net/xxx.yyy.zzz:2181,
> initiating session
>
> 2018-05-22 17:27:45,777 INFO  [main-SendThread(xxx.net:2181)]
> zookeeper.ClientCnxn: Session establishment complete on ser
>
> ver xxx.net/xxx.yyy.zzz:2181, sessionid = 0x16388599b300215, negotiated
> timeout = 4
> 2018-05-22 17:27:45,779 ERROR [main-SendThread(xxx.net:2181)]
> client.ZooKeeperSaslClient: An error: (java.security.Privil
>
> egedActionException: javax.security.sasl.SaslException: GSS initiate
> failed [Caused by GSSException: No valid credentials provided (Mechanism
> level: Server not found in Kerberos database (7) - LOOKING_UP_SERVER)])
> occurred when evaluating Zookeeper Quorum Member's  received SASL token.
> Zookeeper Client will go to AUTH_FAILED state.
>
> 2018-05-22 17:27:45,779 ERROR [main-SendThread(xxx.net:2181)]
> zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum member
> failed: javax.security.sasl.SaslException: An error:
> (java.security.PrivilegedActionException:
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Server not
> found in Kerberos database (7) - LOOKING_UP_SERVER)]) occurred when
> evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper
> Client will go to AUTH_FAILED state.
>
> 2018-05-22 17:28:12,574 WARN  [main-EventThread] zookeeper.ZKUtil:
> hconnection-0x4dcc1d33-0x16388599b300215, quorum=xyz.net:2181,
> baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)
>
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode
> = AuthFailed for /hbase/hbaseid
>
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
>
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1102)
>
> at
>
>
The Zookeeper start command looks like the following:

/usr/lib/jvm/java-openjdk/bin/java -D*zoo*keeper.log.dir=/var/log/*zoo*keeper
> -D*zoo*keeper.root.logger=INFO,ROLLINGFILE -cp /usr/lib/*zoo*
> keeper/bin/../build/classes:/usr/lib/*zoo*
> keeper/bin/../build/lib/*.jar:/usr/lib/*zoo*
> keeper/bin/../lib/slf4j-log4j12-1.6.1.jar:/usr/lib/*zoo*
> keeper/bin/../lib/slf4j-api-1.6.1.jar:/usr/lib/*zoo*
> keeper/bin/../lib/netty-3.10.5.Final.jar:/usr/lib/*zoo*
> keeper/bin/../lib/log4j-1.2.16.jar:/usr/lib/*zoo*
> keeper/bin/../lib/jline-2.11.jar:/usr/lib/*zoo*keeper/bin/../*zoo*
> keeper-3.4.10.jar:/usr/lib/*zoo*keeper/bin/../src/java/lib/*.jar:/etc/
> *zoo*keeper/conf::/etc/*zoo*keeper/conf:/usr/lib/*zoo*keeper/*:/usr/lib/
> *zoo*keeper/lib/* 
> -Djava.security.auth.login.config=/etc/*zoo*keeper/conf/server-jaas.conf
> -D*zoo*keeper.security.auth_to_local=RULE:[2:\$1@\$0](.*@\QAPP.COM\E$)s/@\
> APP.COM\E$//DEFAULT -*zoo*keeper.log.threshold=INFO
> -Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.local.only=false 
> org.apache.*zoo*keeper.server.quorum.QuorumPeerMain
> /etc/*zoo*keeper/conf/*zoo*.cfg
>
>
The property in core-site looks like the following:

  
>
> hadoop.security.auth_to_local
>
> RULE:[2:\$1@
> \$0](.*@\\QPGS.dev\\E$)s/@\\QPGS.dev\\E$//DEFAULT
>
>   
>



At this point I am not clear how I can get the added Kerberos principal  "
krbtgt/app@app.com" (in both clusters' Kerberos KDC's) to be
authenticated against and for 

Re: Anyone Have A Workaround For HBASE-19681?

2018-03-26 Thread Saad Mufti
Restarting the region server worked for us to recover from this error.


Saad


On Fri, Mar 23, 2018 at 7:19 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> We are facing the exact same symptoms in HBase 1.4.0 running on AWS EMR
> based cluster, and desperately need to take a snapshot to feed a downstream
> job. So far we have tried using the "assign" command on all regions
> involved to move them around but the snapshot still fails. Also saw the
> same error earlier in a compaction thread on the same missing file.
>
> Is there anyway we can recover this db? We ran hbck -details and it
> reported no errors.
>
> Thanks.
>
> 
> Saad
>
>


Re: Should Taking A Snapshot Work Even If Balancer Is Moving A Few Regions Around?

2018-03-23 Thread Saad Mufti
Thanks.


Saad


On Wed, Mar 21, 2018 at 3:04 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Looking at
> hbase-client/src/main/java/org/apache/hadoop/hbase/client/Admin.java in
> branch-1.4 :
>
>   boolean[] setSplitOrMergeEnabled(final boolean enabled, final boolean
> synchronous,
>final MasterSwitchType... switchTypes)
> throws IOException;
>
>   boolean isSplitOrMergeEnabled(final MasterSwitchType switchType) throws
> IOException;
>
> Please also see the following script:
>
> hbase-shell/src/main/ruby/shell/commands/splitormerge_switch.rb
>
> FYI
>
> On Wed, Mar 21, 2018 at 11:33 AM, Vladimir Rodionov <
> vladrodio...@gmail.com>
> wrote:
>
> > >>So my question is whether taking a snapshot is supposed to work even
> with
> > >>regions being moved around. In our case it is usually only a couple
> here
> > >>and there.
> >
> > No, if region was moved, split or merged during snapshot operation -
> > snapshot will fail.
> > This is why taking snapshots on a large table is a 50/50 game.
> >
> > Disabling balancer,region merging and split before snapshot should help.
> > This works in 2.0
> >
> > Not sure if merge/split switch is available in 1.4
> >
> > -Vlad
> >
> > On Tue, Mar 20, 2018 at 8:00 PM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > We are using HBase 1.4.0 on AWS EMR based Hbase. Since snapshots are in
> > S3,
> > > they take much longer than when using local disk. We have a cron script
> > to
> > > take regular snapshots as backup, and they fail quite often on our
> > largest
> > > table which takes close to an hour to complete the snapshot.
> > >
> > > The only thing I have noticed in the errors usually is a message about
> > the
> > > region moving or closing.
> > >
> > > So my question is whether taking a snapshot is supposed to work even
> with
> > > regions being moved around. In our case it is usually only a couple
> here
> > > and there.
> > >
> > > Thanks.
> > >
> > > 
> > > Saad
> > >
> >
>


Re: Should Taking A Snapshot Work Even If Balancer Is Moving A Few Regions Around?

2018-03-23 Thread Saad Mufti
Thanks.


Saad

On Wed, Mar 21, 2018 at 2:33 PM, Vladimir Rodionov <vladrodio...@gmail.com>
wrote:

> >>So my question is whether taking a snapshot is supposed to work even with
> >>regions being moved around. In our case it is usually only a couple here
> >>and there.
>
> No, if region was moved, split or merged during snapshot operation -
> snapshot will fail.
> This is why taking snapshots on a large table is a 50/50 game.
>
> Disabling balancer,region merging and split before snapshot should help.
> This works in 2.0
>
> Not sure if merge/split switch is available in 1.4
>
> -Vlad
>
> On Tue, Mar 20, 2018 at 8:00 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Hi,
> >
> > We are using HBase 1.4.0 on AWS EMR based Hbase. Since snapshots are in
> S3,
> > they take much longer than when using local disk. We have a cron script
> to
> > take regular snapshots as backup, and they fail quite often on our
> largest
> > table which takes close to an hour to complete the snapshot.
> >
> > The only thing I have noticed in the errors usually is a message about
> the
> > region moving or closing.
> >
> > So my question is whether taking a snapshot is supposed to work even with
> > regions being moved around. In our case it is usually only a couple here
> > and there.
> >
> > Thanks.
> >
> > 
> > Saad
> >
>


Anyone Have A Workaround For HBASE-19681?

2018-03-23 Thread Saad Mufti
We are facing the exact same symptoms in HBase 1.4.0 running on AWS EMR
based cluster, and desperately need to take a snapshot to feed a downstream
job. So far we have tried using the "assign" command on all regions
involved to move them around but the snapshot still fails. Also saw the
same error earlier in a compaction thread on the same missing file.

Is there anyway we can recover this db? We ran hbck -details and it
reported no errors.

Thanks.


Saad


Re: Balance Regions Faster

2018-03-20 Thread Saad Mufti
Thanks, will take a look. We're using HBase 1.4.0 on AWS EMR.

Cheers.


Saad


On Tue, Mar 20, 2018 at 9:32 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Saad:
> You didn't mention the version of hbase you are using.
> Please check the version to see if the following were included:
>
> HBASE-18164 Fast locality computation in balancer
> HBASE-16570 Compute region locality in parallel at startup
> HBASE-15515 Improve LocalityBasedCandidateGenerator in Balancer
>
> Cheers
>
> On Tue, Mar 20, 2018 at 6:11 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > Please consider tuning the following parameters of stochastic load
> > balancer :
> >
> > "hbase.master.balancer.stochastic.maxRunningTime"
> >
> > default value is 30 seconds. It controls the duration of runtime for
> each balanceCluster()
> > call.
> >
> > "hbase.balancer.period"
> >
> > default is 300 seconds. It controls the maximum time master runs balancer
> > for.
> >
> > You can turn on DEBUG logging and observe the following output in master
> > log:
> >
> > balancer.StochasticLoadBalancer: Finished computing new load balance
> plan.  Computation took 1200227ms to try 2254 different iterations.  Found
> a solution that moves 550 regions; Going from a computed cost of
> 77.52829271038965 to a new cost of 74.32764924425548
> >
> > If you have a dev cluster, you can try different combinations of the
> above
> > two parameters and get best performance by checking the above log.
> >
> > Cheers
> >
> > On Tue, Mar 20, 2018 at 5:18 PM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> We are using the stochastic load balancer, and have tuned it to do a
> >> maximum of 1% of regions in any calculation. But it is way too
> >> conservative
> >> after that, it moves one region at a time. Is there a way to tell it to
> go
> >> faster with whatever number of regions it decided to do? I have been
> >> looking at the settings in the code but so far a little confused about
> >> which exact setting will achieve this. Is it one of the steps settings?
> >>
> >> Also our cluster was in bad shape for a while as a bunch of region
> servers
> >> aborted for different reasons. When we stopped everything and brought it
> >> back up and enabled the tables, most of the region servers were assigned
> >> no
> >> regions. Is it for locality reasons that HBase is trying to assign
> regions
> >> where they were assigned before? Is there a way to tell HBase to ignore
> >> that on startup?
> >>
> >> Thanks.
> >>
> >> 
> >> Saad
> >>
> >
> >
>


Should Taking A Snapshot Work Even If Balancer Is Moving A Few Regions Around?

2018-03-20 Thread Saad Mufti
Hi,

We are using HBase 1.4.0 on AWS EMR based Hbase. Since snapshots are in S3,
they take much longer than when using local disk. We have a cron script to
take regular snapshots as backup, and they fail quite often on our largest
table which takes close to an hour to complete the snapshot.

The only thing I have noticed in the errors usually is a message about the
region moving or closing.

So my question is whether taking a snapshot is supposed to work even with
regions being moved around. In our case it is usually only a couple here
and there.

Thanks.


Saad


Balance Regions Faster

2018-03-20 Thread Saad Mufti
Hi,

We are using the stochastic load balancer, and have tuned it to do a
maximum of 1% of regions in any calculation. But it is way too conservative
after that, it moves one region at a time. Is there a way to tell it to go
faster with whatever number of regions it decided to do? I have been
looking at the settings in the code but so far a little confused about
which exact setting will achieve this. Is it one of the steps settings?

Also our cluster was in bad shape for a while as a bunch of region servers
aborted for different reasons. When we stopped everything and brought it
back up and enabled the tables, most of the region servers were assigned no
regions. Is it for locality reasons that HBase is trying to assign regions
where they were assigned before? Is there a way to tell HBase to ignore
that on startup?

Thanks.


Saad


Re: CorruptedSnapshotException Taking Snapshot Of Table With Large Number Of Files

2018-03-19 Thread Saad Mufti
Thanks, I tried briefly but maybe I didn't do quite the right search. In
any case, thanks for the help.


Saad


On Mon, Mar 19, 2018 at 2:50 PM, Huaxiang Sun <h...@cloudera.com> wrote:

> You can google search the exception stack and mostly it will find the JIRA.
>
> Regards,
>
> Huaxiang
>
> > On Mar 19, 2018, at 10:52 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >
> > Thanks!!! Wish that was documented somewhere in the manual.
> >
> > Cheers.
> >
> > 
> > Saad
> >
> >
> > On Mon, Mar 19, 2018 at 1:38 PM, Huaxiang Sun <h...@cloudera.com> wrote:
> >
> >> Mostly it is due to HBASE-15430 <https://issues.apache.org/
> >> jira/browse/HBASE-15430>, “snapshot.manifest.size.limit” needs to be
> >> configured as 64MB or 128MB.
> >>
> >> Regards,
> >>
> >> Huaxiang Sun
> >>
> >>
> >>> On Mar 19, 2018, at 10:16 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We are running on HBase 1.4.0 on an AWS EMR/HBase cluster.
> >>>
> >>> We have started seeing the following stacktrace when trying to take a
> >>> snapshot of a table with a very large number of files (12000 regions
> and
> >>> roughly 36 - 40 files). The number of files should go down as
> we
> >>> haven't been compacting for a while for other operational reasons and
> are
> >>> now running it. But I'd to understand why our snapshots are failing
> with
> >>> the following:
> >>>
> >>> 2018-03-19 16:05:56,948 ERROR
> >>>> [MASTER_TABLE_OPERATIONS-ip-10-194-208-6:16000-0]
> >>>> snapshot.TakeSnapshotHandler: Failed taking snapshot {
> >>>> ss=pgs-device.03-19-2018-15 table=pgs-device type=SKIPFLUSH } due to
> >>>> exception:unable to parse data manifest Protocol message was too
> >> large.  May
> >>>> be malicious.  Use CodedInputStream.setSizeLimit() to increase the
> size
> >>>> limit.
> >>>>
> >>>> org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: unable
> to
> >>>> parse data manifest Protocol message was too large.  May be malicious.
> >> Use
> >>>> CodedInputStream.setSizeLimit() to increase the size limit.
> >>>>
> >>>>   at
> >>>> org.apache.hadoop.hbase.snapshot.SnapshotManifest.readDataManifest(
> >> SnapshotManifest.java:468)
> >>>>
> >>>>   at
> >>>> org.apache.hadoop.hbase.snapshot.SnapshotManifest.
> >> load(SnapshotManifest.java:297)
> >>>>
> >>>>   at
> >>>> org.apache.hadoop.hbase.snapshot.SnapshotManifest.
> >> open(SnapshotManifest.java:129)
> >>>>
> >>>>   at
> >>>> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.
> >> verifySnapshot(MasterSnapshotVerifier.java:108)
> >>>>
> >>>>   at
> >>>> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(
> >> TakeSnapshotHandler.java:203)
> >>>>
> >>>>   at
> >>>> org.apache.hadoop.hbase.executor.EventHandler.run(
> >> EventHandler.java:129)
> >>>>
> >>>>   at
> >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1149)
> >>>>
> >>>>   at
> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:624)
> >>>>
> >>>>   at java.lang.Thread.run(Thread.java:748)
> >>>>
> >>>> Caused by: com.google.protobuf.InvalidProtocolBufferException:
> Protocol
> >>>> message was too large.  May be malicious.  Use
> >>>> CodedInputStream.setSizeLimit() to increase the size limit.
> >>>>
> >>>>   at
> >>>> com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(
> >> InvalidProtocolBufferException.java:110)
> >>>>
> >>>>   at
> >>>> com.google.protobuf.CodedInputStream.refillBuffer(
> >> CodedInputStream.java:755)
> >>>>
> >>>>   at
> >>>> com.google.protobuf.CodedInputStream.readRawBytes(
> >> CodedInputStream.java:811)
> >>>>
> >>>>   at
> >>>&g

Re: CorruptedSnapshotException Taking Snapshot Of Table With Large Number Of Files

2018-03-19 Thread Saad Mufti
Thanks!!! Wish that was documented somewhere in the manual.

Cheers.


Saad


On Mon, Mar 19, 2018 at 1:38 PM, Huaxiang Sun <h...@cloudera.com> wrote:

> Mostly it is due to HBASE-15430 <https://issues.apache.org/
> jira/browse/HBASE-15430>, “snapshot.manifest.size.limit” needs to be
> configured as 64MB or 128MB.
>
> Regards,
>
> Huaxiang Sun
>
>
> > On Mar 19, 2018, at 10:16 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >
> > Hi,
> >
> > We are running on HBase 1.4.0 on an AWS EMR/HBase cluster.
> >
> > We have started seeing the following stacktrace when trying to take a
> > snapshot of a table with a very large number of files (12000 regions and
> > roughly 36 - 40 files). The number of files should go down as we
> > haven't been compacting for a while for other operational reasons and are
> > now running it. But I'd to understand why our snapshots are failing with
> > the following:
> >
> > 2018-03-19 16:05:56,948 ERROR
> >> [MASTER_TABLE_OPERATIONS-ip-10-194-208-6:16000-0]
> >> snapshot.TakeSnapshotHandler: Failed taking snapshot {
> >> ss=pgs-device.03-19-2018-15 table=pgs-device type=SKIPFLUSH } due to
> >> exception:unable to parse data manifest Protocol message was too
> large.  May
> >> be malicious.  Use CodedInputStream.setSizeLimit() to increase the size
> >> limit.
> >>
> >> org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: unable to
> >> parse data manifest Protocol message was too large.  May be malicious.
> Use
> >> CodedInputStream.setSizeLimit() to increase the size limit.
> >>
> >>at
> >> org.apache.hadoop.hbase.snapshot.SnapshotManifest.readDataManifest(
> SnapshotManifest.java:468)
> >>
> >>at
> >> org.apache.hadoop.hbase.snapshot.SnapshotManifest.
> load(SnapshotManifest.java:297)
> >>
> >>at
> >> org.apache.hadoop.hbase.snapshot.SnapshotManifest.
> open(SnapshotManifest.java:129)
> >>
> >>at
> >> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.
> verifySnapshot(MasterSnapshotVerifier.java:108)
> >>
> >>at
> >> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(
> TakeSnapshotHandler.java:203)
> >>
> >>at
> >> org.apache.hadoop.hbase.executor.EventHandler.run(
> EventHandler.java:129)
> >>
> >>at
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
> >>
> >>at
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
> >>
> >>at java.lang.Thread.run(Thread.java:748)
> >>
> >> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
> >> message was too large.  May be malicious.  Use
> >> CodedInputStream.setSizeLimit() to increase the size limit.
> >>
> >>at
> >> com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(
> InvalidProtocolBufferException.java:110)
> >>
> >>at
> >> com.google.protobuf.CodedInputStream.refillBuffer(
> CodedInputStream.java:755)
> >>
> >>at
> >> com.google.protobuf.CodedInputStream.readRawBytes(
> CodedInputStream.java:811)
> >>
> >>at
> >> com.google.protobuf.CodedInputStream.readBytes(
> CodedInputStream.java:329)
> >>
> >>at
> >> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$
> SnapshotRegionManifest$StoreFile.(SnapshotProtos.java:1313)
> >>
> >>at
> >> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$
> SnapshotRegionManifest$StoreFile.(SnapshotProtos.java:1263)
> >>
> >>at
> >> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$
> SnapshotRegionManifest$StoreFile$1.parsePartialFrom(
> SnapshotProtos.java:1364)
> >>
> >>at
> >> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$
> SnapshotRegionManifest$StoreFile$1.parsePartialFrom(
> SnapshotProtos.java:1359)
> >>
> >>at
> >> com.google.protobuf.CodedInputStream.readMessage(
> CodedInputStream.java:309)
> >>
> >>at
> >> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$
> SnapshotRegionManifest$FamilyFiles.(SnapshotProtos.java:2161)
> >>
> >>at
> >> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$
> Sna

CorruptedSnapshotException Taking Snapshot Of Table With Large Number Of Files

2018-03-19 Thread Saad Mufti
Hi,

We are running on HBase 1.4.0 on an AWS EMR/HBase cluster.

We have started seeing the following stacktrace when trying to take a
snapshot of a table with a very large number of files (12000 regions and
roughly 36 - 40 files). The number of files should go down as we
haven't been compacting for a while for other operational reasons and are
now running it. But I'd to understand why our snapshots are failing with
the following:

2018-03-19 16:05:56,948 ERROR
> [MASTER_TABLE_OPERATIONS-ip-10-194-208-6:16000-0]
> snapshot.TakeSnapshotHandler: Failed taking snapshot {
> ss=pgs-device.03-19-2018-15 table=pgs-device type=SKIPFLUSH } due to
> exception:unable to parse data manifest Protocol message was too large.  May
> be malicious.  Use CodedInputStream.setSizeLimit() to increase the size
> limit.
>
> org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: unable to
> parse data manifest Protocol message was too large.  May be malicious.  Use
> CodedInputStream.setSizeLimit() to increase the size limit.
>
> at
> org.apache.hadoop.hbase.snapshot.SnapshotManifest.readDataManifest(SnapshotManifest.java:468)
>
> at
> org.apache.hadoop.hbase.snapshot.SnapshotManifest.load(SnapshotManifest.java:297)
>
> at
> org.apache.hadoop.hbase.snapshot.SnapshotManifest.open(SnapshotManifest.java:129)
>
> at
> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:108)
>
> at
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:203)
>
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
> message was too large.  May be malicious.  Use
> CodedInputStream.setSizeLimit() to increase the size limit.
>
> at
> com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110)
>
> at
> com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755)
>
> at
> com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811)
>
> at
> com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest$StoreFile.(SnapshotProtos.java:1313)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest$StoreFile.(SnapshotProtos.java:1263)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest$StoreFile$1.parsePartialFrom(SnapshotProtos.java:1364)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest$StoreFile$1.parsePartialFrom(SnapshotProtos.java:1359)
>
> at
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest$FamilyFiles.(SnapshotProtos.java:2161)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest$FamilyFiles.(SnapshotProtos.java:2103)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest$FamilyFiles$1.parsePartialFrom(SnapshotProtos.java:2197)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest$FamilyFiles$1.parsePartialFrom(SnapshotProtos.java:2192)
>
> at
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest.(SnapshotProtos.java:1165)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest.(SnapshotProtos.java:1094)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest$1.parsePartialFrom(SnapshotProtos.java:1201)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotRegionManifest$1.parsePartialFrom(SnapshotProtos.java:1196)
>
> at
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotDataManifest.(SnapshotProtos.java:3858)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotDataManifest.(SnapshotProtos.java:3792)
>
> at
> org.apache.hadoop.hbase.protobuf.generated.SnapshotProtos$SnapshotDataManifest$1.parsePartialFrom(SnapshotProtos.java:3894)
>
> at
> 

Re: Scan problem

2018-03-19 Thread Saad Mufti
Another option if you have enough disk space/off heap memory space is to
enable bucket cache to cache even more of your data, and set the
PREFETCH_ON_OPEN => true option on the column families you want always
cache. That way HBase will prefetch your data into the bucket cache and
your scan won't have that initial slowdown. Or if you want to do it
globally for all column families, set the configuration flag
"hbase.rs.prefetchblocksonopen" to "true". Keep in mind though that if you
do this, you should either have enough bucket cache space for all your
data, otherwise there will be a lot of useless eviction activity at HBase
startup and even later.

Also, where a region is located will also be heavily impacted by which
region balancer you have chosen and how you have tuned it in terms of how
often to run and other parameters. A split region will stay initially at
least on the same region server but your balancer if and when run can move
it (an indeed any region) elsewhere to satisfy its criteria.

Cheers.


Saad


On Mon, Mar 19, 2018 at 1:14 AM, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> Hi
>
> First regarding the scans,
>
> Generally the data resides in the store files which is in HDFS. So probably
> the first scan that you are doing is reading from HDFS which involves disk
> reads. Once the blocks are read, they are cached in the Block cache of
> HBase. So your further reads go through that and hence you see further
> speed up in the scans.
>
> >> And another question about region split, I want to know which
> RegionServer
> will load the new region afther splited ,
> Will they be the same One with the old region?
> Yes . Generally same region server hosts it.
>
> In master the code is here,
> https://github.com/apache/hbase/blob/master/hbase-
> server/src/main/java/org/apache/hadoop/hbase/master/assignment/
> SplitTableRegionProcedure.java
>
> You may need to understand the entire flow to know how the regions are
> opened after a split.
>
> Regards
> Ram
>
> On Sat, Mar 17, 2018 at 9:02 PM, Yang Zhang 
> wrote:
>
> > Hello everyone
> >
> > I try to do many Scan use RegionScanner in coprocessor, and
> ervery
> > time ,the first Scan cost  about 10 times than the other,
> > I don't know why this will happen
> >
> > OneBucket Scan cost is : 8794 ms Num is : 710
> > OneBucket Scan cost is : 91 ms Num is : 776
> > OneBucket Scan cost is : 87 ms Num is : 808
> > OneBucket Scan cost is : 105 ms Num is : 748
> > OneBucket Scan cost is : 68 ms Num is : 200
> >
> >
> > And another question about region split, I want to know which
> RegionServer
> > will load the new region afther splited ,
> > Will they be the same One with the old region?  Anyone know where I can
> > find the code to learn about that?
> >
> >
> > Thanks for your help
> >
>


Re: TableSnapshotInputFormat Behavior In HBase 1.4.0

2018-03-12 Thread Saad Mufti
Thanks, will do that.


Saad


On Mon, Mar 12, 2018 at 12:14 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Saad:
> I encourage you to open an HBase JIRA outlining your use case and the
> config knobs you added through a patch.
>
> We can see the details for each config and make recommendation accordingly.
>
> Thanks
>
> On Mon, Mar 12, 2018 at 8:43 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > I have create a company specific branch and added 4 new flags to control
> > this behavior, these gave us a huge performance boost when running Spark
> > jobs on snapshots of very large tables in S3. I tried to do everything
> > cleanly but
> >
> > a) not being familiar with the whole test strategies I haven't had time
> to
> > add any useful tests, though of course I left the default behavior the
> > same, and a lot of the behavior I control wit these flags only affect
> > performance, not the final result, so I would need some pointers on how
> to
> > add useful tests
> > b) I added a new flag to be an overall override for prefetch behavior
> that
> > overrides any setting even in the column family descriptor, not sure if
> > what I did was entirely in the spirit of what HBase does
> >
> > Again these if used properly would only impact jobs using
> > TableSnapshotInputFormat in their Spark or M-R jobs. Would someone from
> the
> > core team be willing to look at my patch? I have never done this before,
> so
> > would appreciate a quick pointer on how to send a patch and get some
> quick
> > feedback.
> >
> > Cheers.
> >
> > 
> > Saad
> >
> >
> >
> > On Sat, Mar 10, 2018 at 9:56 PM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >
> > > The question remain though of why it is even accessing a column
> family's
> > > files that should be excluded based on the Scan. And that column family
> > > does NOT specify prefetch on open in its schema. Only the one we want
> to
> > > read specifies prefetch on open, which we want to override if possible
> > for
> > > the Spark job.
> > >
> > > 
> > > Saad
> > >
> > >
> > > On Sat, Mar 10, 2018 at 9:51 PM, Saad Mufti <saad.mu...@gmail.com>
> > wrote:
> > >
> > >> See below more I found on item 3.
> > >>
> > >> Cheers.
> > >>
> > >> 
> > >> Saad
> > >>
> > >> On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <saad.mu...@gmail.com>
> > wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS.
> There
> > >>> is no Hbase installed on the cluster, only HBase libs linked to my
> > Spark
> > >>> app. We are reading the snapshot info from a HBase folder in S3 using
> > >>> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job
> > read
> > >>> snapshot info directly from the S3 based filesystem instead of going
> > >>> through any region server.
> > >>>
> > >>> I have observed a few behaviors while debugging performance that are
> > >>> concerning, some we could mitigate and other I am looking for clarity
> > on:
> > >>>
> > >>> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
> > >>> information for the region splits, for a snapshots with a large
> number
> > of
> > >>> files (over 35 in our case) this causing single threaded scan of
> > all
> > >>> the file listings in a single thread in the driver. And it was
> useless
> > >>> because there is really no useful locality information to glean since
> > all
> > >>> the files are in S3 and not HDFS. So I was forced to make a copy of
> > >>> TableSnapshotInputFormatImpl.java in our code and control this with
> a
> > >>> config setting I made up. That got rid of the hours long scan, so I
> am
> > good
> > >>> with this part for now.
> > >>>
> > >>> 2) I have set a single column family in the Scan that I set on the
> > hbase
> > >>> configuration via
> > >>>
> > >>> scan.addFamily(str.getBytes()))
> > >>>
> > >>> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
> > >>>
> > >>>
> > >>> But when this code is executing u

Re: TableSnapshotInputFormat Behavior In HBase 1.4.0

2018-03-12 Thread Saad Mufti
I have create a company specific branch and added 4 new flags to control
this behavior, these gave us a huge performance boost when running Spark
jobs on snapshots of very large tables in S3. I tried to do everything
cleanly but

a) not being familiar with the whole test strategies I haven't had time to
add any useful tests, though of course I left the default behavior the
same, and a lot of the behavior I control wit these flags only affect
performance, not the final result, so I would need some pointers on how to
add useful tests
b) I added a new flag to be an overall override for prefetch behavior that
overrides any setting even in the column family descriptor, not sure if
what I did was entirely in the spirit of what HBase does

Again these if used properly would only impact jobs using
TableSnapshotInputFormat in their Spark or M-R jobs. Would someone from the
core team be willing to look at my patch? I have never done this before, so
would appreciate a quick pointer on how to send a patch and get some quick
feedback.

Cheers.


Saad



On Sat, Mar 10, 2018 at 9:56 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> The question remain though of why it is even accessing a column family's
> files that should be excluded based on the Scan. And that column family
> does NOT specify prefetch on open in its schema. Only the one we want to
> read specifies prefetch on open, which we want to override if possible for
> the Spark job.
>
> 
> Saad
>
>
> On Sat, Mar 10, 2018 at 9:51 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
>> See below more I found on item 3.
>>
>> Cheers.
>>
>> 
>> Saad
>>
>> On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There
>>> is no Hbase installed on the cluster, only HBase libs linked to my Spark
>>> app. We are reading the snapshot info from a HBase folder in S3 using
>>> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
>>> snapshot info directly from the S3 based filesystem instead of going
>>> through any region server.
>>>
>>> I have observed a few behaviors while debugging performance that are
>>> concerning, some we could mitigate and other I am looking for clarity on:
>>>
>>> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
>>> information for the region splits, for a snapshots with a large number of
>>> files (over 35 in our case) this causing single threaded scan of all
>>> the file listings in a single thread in the driver. And it was useless
>>> because there is really no useful locality information to glean since all
>>> the files are in S3 and not HDFS. So I was forced to make a copy of
>>> TableSnapshotInputFormatImpl.java in our code and control this with a
>>> config setting I made up. That got rid of the hours long scan, so I am good
>>> with this part for now.
>>>
>>> 2) I have set a single column family in the Scan that I set on the hbase
>>> configuration via
>>>
>>> scan.addFamily(str.getBytes()))
>>>
>>> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
>>>
>>>
>>> But when this code is executing under Spark and I observe the threads
>>> and logs on Spark executors, I it is reading from S3 files for a column
>>> family that was not included in the scan. This column family was
>>> intentionally excluded because it is much larger than the others and so we
>>> wanted to avoid the cost.
>>>
>>> Any advice on what I am doing wrong would be appreciated.
>>>
>>> 3) We also explicitly set caching of blocks to false on the scan,
>>> although I see that in TableSnapshotInputFormatImpl.java it is again
>>> set to false internally also. But when running the Spark job, some
>>> executors were taking much longer than others, and when I observe their
>>> threads, I see periodic messages about a few hundred megs of RAM used by
>>> the block cache, and the thread is sitting there reading data from S3, and
>>> is occasionally blocked a couple of other threads that have the
>>> "hfile-prefetcher" name in them. Going back to 2) above, they seem to be
>>> reading the wrong column family, but in this item I am more concerned about
>>> why they appear to be prefetching blocks and caching them, when the Scan
>>> object has a setting to not cache blocks at all?
>>>
>>
>> I think I figured out item 3, the column family descriptor for the table
>> in question has prefetch on open set in its schema. Now for the Spark job,
>> I don't think this serves any useful purpose does it? But I can't see any
>> way to override it. If these is, I'd appreciate some advice.
>>
>
>> Thanks.
>>
>>
>>>
>>> Thanks in advance for any insights anyone can provide.
>>>
>>> 
>>> Saad
>>>
>>>
>>
>>
>


Re: How Long Will HBase Hold A Row Write Lock?

2018-03-11 Thread Saad Mufti
Thanks. I left a comment on that ticket.


Saad


On Sat, Mar 10, 2018 at 11:57 PM, Anoop John <anoop.hb...@gmail.com> wrote:

> Hi Saad
>In your initial mail you mentioned that there are lots
> of checkAndPut ops but on different rows. The failure in obtaining
> locks (write lock as it is checkAndPut) means there is contention on
> the same row key.  If that is the case , ya that is the 1st step
> before BC reads and it make sense.
>
> On the Q on why not caching the compacted file content, yes it is this
> way. Even if cache on write is true. This is because some times the
> compacted result file could be so large (what is major compaction) and
> that will exhaust the BC if written. Also it might contain some data
> which are very old.  There is a jira recently raised jira which
> discuss abt this.  Pls see HBASE-20045
>
>
> -Anoop-
>
> On Sun, Mar 11, 2018 at 7:57 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> > Although now that I think about this a bit more, all the failures we saw
> > were failure to obtain a row lock, and in the thread stack traces we
> always
> > saw it somewhere inside getRowLockInternal and similar. Never saw any
> > contention on bucket cache lock that I could see.
> >
> > Cheers.
> >
> > 
> > Saad
> >
> >
> > On Sat, Mar 10, 2018 at 8:04 PM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >
> >> Also, for now we have mitigated this problem by using the new setting in
> >> HBase 1.4.0 that prevents one slow region server from blocking all
> client
> >> requests. Of course it causes some timeouts but our overall ecosystem
> >> contains Kafka queues for retries, so we can live with that. From what I
> >> can see, it looks like this setting also has the good effect of
> preventing
> >> clients from hammering a region server that is slow because its IPC
> queues
> >> are backed up, allowing it to recover faster.
> >>
> >> Does that make sense?
> >>
> >> Cheers.
> >>
> >> 
> >> Saad
> >>
> >>
> >> On Sat, Mar 10, 2018 at 7:04 PM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >>
> >>> So if I understand correctly, we would mitigate the problem by not
> >>> evicting blocks for archived files immediately? Wouldn't this
> potentially
> >>> lead to problems later if the LRU algo chooses to evict blocks for
> active
> >>> files and leave blocks for archived files in there?
> >>>
> >>> I would definitely love to test this!!! Unfortunately we are running on
> >>> EMR and the details of how to patch HBase under EMR are not clear to
> me :-(
> >>>
> >>> What we would really love would be a setting for actually immediately
> >>> caching blocks for a new compacted file. I have seen in the code that
> even
> >>> is we have the cache on write setting set to true, it will refuse to
> cache
> >>> blocks for a file that is a newly compacted one. In our case we have
> sized
> >>> the bucket cache to be big enough to hold all our data, and really
> want to
> >>> avoid having to go to S3 until the last possible moment. A config
> setting
> >>> to test this would be great.
> >>>
> >>> But thanks everyone for your feedback. Any more would also be welcome
> on
> >>> the idea to let a user cache all newly compacted files.
> >>>
> >>> 
> >>> Saad
> >>>
> >>>
> >>> On Wed, Mar 7, 2018 at 12:00 AM, Anoop John <anoop.hb...@gmail.com>
> >>> wrote:
> >>>
> >>>> >>a) it was indeed one of the regions that was being compacted, major
> >>>> compaction in one case, minor compaction in another, the issue started
> >>>> just
> >>>> after compaction completed blowing away bucket cached blocks for the
> >>>> older
> >>>> HFile's
> >>>>
> >>>> About this part.Ya after the compaction, there is a step where the
> >>>> compacted away HFile's blocks getting removed from cache. This op
> takes a
> >>>> write lock for this region (In Bucket Cache layer)..  Every read op
> which
> >>>> is part of checkAndPut will try read from BC and that in turn need a
> read
> >>>> lock for this region.  So there is chances that the read locks starve
> >>>> because of so many frequent write locks .  Each block evict will
> attain
> 

Re: TableSnapshotInputFormat Behavior In HBase 1.4.0

2018-03-10 Thread Saad Mufti
The question remain though of why it is even accessing a column family's
files that should be excluded based on the Scan. And that column family
does NOT specify prefetch on open in its schema. Only the one we want to
read specifies prefetch on open, which we want to override if possible for
the Spark job.


Saad

On Sat, Mar 10, 2018 at 9:51 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> See below more I found on item 3.
>
> Cheers.
>
> 
> Saad
>
> On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
>> Hi,
>>
>> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
>> no Hbase installed on the cluster, only HBase libs linked to my Spark app.
>> We are reading the snapshot info from a HBase folder in S3 using
>> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
>> snapshot info directly from the S3 based filesystem instead of going
>> through any region server.
>>
>> I have observed a few behaviors while debugging performance that are
>> concerning, some we could mitigate and other I am looking for clarity on:
>>
>> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
>> information for the region splits, for a snapshots with a large number of
>> files (over 35 in our case) this causing single threaded scan of all
>> the file listings in a single thread in the driver. And it was useless
>> because there is really no useful locality information to glean since all
>> the files are in S3 and not HDFS. So I was forced to make a copy of
>> TableSnapshotInputFormatImpl.java in our code and control this with a
>> config setting I made up. That got rid of the hours long scan, so I am good
>> with this part for now.
>>
>> 2) I have set a single column family in the Scan that I set on the hbase
>> configuration via
>>
>> scan.addFamily(str.getBytes()))
>>
>> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
>>
>>
>> But when this code is executing under Spark and I observe the threads and
>> logs on Spark executors, I it is reading from S3 files for a column family
>> that was not included in the scan. This column family was intentionally
>> excluded because it is much larger than the others and so we wanted to
>> avoid the cost.
>>
>> Any advice on what I am doing wrong would be appreciated.
>>
>> 3) We also explicitly set caching of blocks to false on the scan,
>> although I see that in TableSnapshotInputFormatImpl.java it is again set
>> to false internally also. But when running the Spark job, some executors
>> were taking much longer than others, and when I observe their threads, I
>> see periodic messages about a few hundred megs of RAM used by the block
>> cache, and the thread is sitting there reading data from S3, and is
>> occasionally blocked a couple of other threads that have the
>> "hfile-prefetcher" name in them. Going back to 2) above, they seem to be
>> reading the wrong column family, but in this item I am more concerned about
>> why they appear to be prefetching blocks and caching them, when the Scan
>> object has a setting to not cache blocks at all?
>>
>
> I think I figured out item 3, the column family descriptor for the table
> in question has prefetch on open set in its schema. Now for the Spark job,
> I don't think this serves any useful purpose does it? But I can't see any
> way to override it. If these is, I'd appreciate some advice.
>

> Thanks.
>
>
>>
>> Thanks in advance for any insights anyone can provide.
>>
>> 
>> Saad
>>
>>
>
>


Re: TableSnapshotInputFormat Behavior In HBase 1.4.0

2018-03-10 Thread Saad Mufti
See below more I found on item 3.

Cheers.


Saad

On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Hi,
>
> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
> no Hbase installed on the cluster, only HBase libs linked to my Spark app.
> We are reading the snapshot info from a HBase folder in S3 using
> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
> snapshot info directly from the S3 based filesystem instead of going
> through any region server.
>
> I have observed a few behaviors while debugging performance that are
> concerning, some we could mitigate and other I am looking for clarity on:
>
> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
> information for the region splits, for a snapshots with a large number of
> files (over 35 in our case) this causing single threaded scan of all
> the file listings in a single thread in the driver. And it was useless
> because there is really no useful locality information to glean since all
> the files are in S3 and not HDFS. So I was forced to make a copy of
> TableSnapshotInputFormatImpl.java in our code and control this with a
> config setting I made up. That got rid of the hours long scan, so I am good
> with this part for now.
>
> 2) I have set a single column family in the Scan that I set on the hbase
> configuration via
>
> scan.addFamily(str.getBytes()))
>
> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
>
>
> But when this code is executing under Spark and I observe the threads and
> logs on Spark executors, I it is reading from S3 files for a column family
> that was not included in the scan. This column family was intentionally
> excluded because it is much larger than the others and so we wanted to
> avoid the cost.
>
> Any advice on what I am doing wrong would be appreciated.
>
> 3) We also explicitly set caching of blocks to false on the scan, although
> I see that in TableSnapshotInputFormatImpl.java it is again set to false
> internally also. But when running the Spark job, some executors were taking
> much longer than others, and when I observe their threads, I see periodic
> messages about a few hundred megs of RAM used by the block cache, and the
> thread is sitting there reading data from S3, and is occasionally blocked a
> couple of other threads that have the "hfile-prefetcher" name in them.
> Going back to 2) above, they seem to be reading the wrong column family,
> but in this item I am more concerned about why they appear to be
> prefetching blocks and caching them, when the Scan object has a setting to
> not cache blocks at all?
>

I think I figured out item 3, the column family descriptor for the table in
question has prefetch on open set in its schema. Now for the Spark job, I
don't think this serves any useful purpose does it? But I can't see any way
to override it. If these is, I'd appreciate some advice.

Thanks.


>
> Thanks in advance for any insights anyone can provide.
>
> 
> Saad
>
>


Re: How Long Will HBase Hold A Row Write Lock?

2018-03-10 Thread Saad Mufti
Although now that I think about this a bit more, all the failures we saw
were failure to obtain a row lock, and in the thread stack traces we always
saw it somewhere inside getRowLockInternal and similar. Never saw any
contention on bucket cache lock that I could see.

Cheers.


Saad


On Sat, Mar 10, 2018 at 8:04 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Also, for now we have mitigated this problem by using the new setting in
> HBase 1.4.0 that prevents one slow region server from blocking all client
> requests. Of course it causes some timeouts but our overall ecosystem
> contains Kafka queues for retries, so we can live with that. From what I
> can see, it looks like this setting also has the good effect of preventing
> clients from hammering a region server that is slow because its IPC queues
> are backed up, allowing it to recover faster.
>
> Does that make sense?
>
> Cheers.
>
> 
> Saad
>
>
> On Sat, Mar 10, 2018 at 7:04 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
>> So if I understand correctly, we would mitigate the problem by not
>> evicting blocks for archived files immediately? Wouldn't this potentially
>> lead to problems later if the LRU algo chooses to evict blocks for active
>> files and leave blocks for archived files in there?
>>
>> I would definitely love to test this!!! Unfortunately we are running on
>> EMR and the details of how to patch HBase under EMR are not clear to me :-(
>>
>> What we would really love would be a setting for actually immediately
>> caching blocks for a new compacted file. I have seen in the code that even
>> is we have the cache on write setting set to true, it will refuse to cache
>> blocks for a file that is a newly compacted one. In our case we have sized
>> the bucket cache to be big enough to hold all our data, and really want to
>> avoid having to go to S3 until the last possible moment. A config setting
>> to test this would be great.
>>
>> But thanks everyone for your feedback. Any more would also be welcome on
>> the idea to let a user cache all newly compacted files.
>>
>> 
>> Saad
>>
>>
>> On Wed, Mar 7, 2018 at 12:00 AM, Anoop John <anoop.hb...@gmail.com>
>> wrote:
>>
>>> >>a) it was indeed one of the regions that was being compacted, major
>>> compaction in one case, minor compaction in another, the issue started
>>> just
>>> after compaction completed blowing away bucket cached blocks for the
>>> older
>>> HFile's
>>>
>>> About this part.Ya after the compaction, there is a step where the
>>> compacted away HFile's blocks getting removed from cache. This op takes a
>>> write lock for this region (In Bucket Cache layer)..  Every read op which
>>> is part of checkAndPut will try read from BC and that in turn need a read
>>> lock for this region.  So there is chances that the read locks starve
>>> because of so many frequent write locks .  Each block evict will attain
>>> the
>>> write lock one after other.  Will it be possible for you to patch this
>>> evict and test once? We can avoid the immediate evict from BC after
>>> compaction. I can help you with a patch if you wish
>>>
>>> Anoop
>>>
>>>
>>>
>>> On Mon, Mar 5, 2018 at 11:07 AM, ramkrishna vasudevan <
>>> ramkrishna.s.vasude...@gmail.com> wrote:
>>> > Hi Saad
>>> >
>>> > Your argument here
>>> >>> The
>>> >>>theory is that since prefetch is an async operation, a lot of the
>>> reads
>>> in
>>> >>>the checkAndPut for the region in question start reading from S3
>>> which is
>>> >>>slow. So the write lock obtained for the checkAndPut is held for a
>>> longer
>>> >>>duration than normal. This has cascading upstream effects. Does that
>>> sound
>>> >>>plausible?
>>> >
>>> > Seems very much plausible. So before even the prefetch happens say for
>>> > 'block 1' - and you have already issues N checkAndPut calls for the
>>> rows
>>> in
>>> > that 'block 1' -  all those checkAndPut will have to read that block
>>> from
>>> > S3 to perform the get() and then apply the mutation.
>>> >
>>> > This may happen for multiple threads at the same time because we are
>>> not
>>> > sure when the prefetch would have actually been completed. I don know
>>> what
>>> > are the general read characteri

Re: HBase failed on local exception and failed servers list.

2018-03-10 Thread Saad Mufti
Are you using AuthUtil class to reauthenticate? This class is in Hbase, and
uses the Hadoop class UserGroupInformation to do the actual login and
re-login. But, if your UserGroupInformation class is from Hadoop 2.5.1 or
earlier, it has a bug if you are using Java 8, as most of us are. The
relogin code uses a test to decide whether the login is kerberos/keytab
based, and that test used to pass on Java 7 but fails in Java 8 because the
test tests for some specific class being in some underlying list of
kerberos objects assigned to your principal, which has disappeared in the
Java 8 implementation. We fixed this by upgrading our Hadoop dependency
explicitly to a newer version, in our case 2.6.1 and they have fixed this
problem in that newer version.

If this is the condition affecting your application, it is an easy enough
fix.

Hope this helps.

Cheers.


Saad



On Tue, Feb 27, 2018 at 1:16 PM, apratim sharma 
wrote:

> Hi Guys,
>
> I am using hbase 1.2.0 on a kerberos secured cloudera CDH 5.8 cluster.
> I have a persistant application that authenticates using keytab and creates
> hbase connection. Our code also takes care of reauthentication and
> recreating broken connectiion.
> The code worked fine in previous versions of hbase. However what we see
> with Hbase 1.2 is that after 24 hours the hbase connection does not work
> giving following error
>
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=2, exceptions:
> Tue Feb 13 12:57:51 PST 2018,
> RpcRetryingCaller{globalStartTime=1518555467140, pause=100, retries=2},
> org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to
> pdmcdh01.xyz.com/192.168.145.62:60020 failed on local exception:
> org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection
> to pdmcdh01.xyz.com/192.168.145.62:60020 is closing. Call id=137,
> waitTime=11
> Tue Feb 13 12:58:01 PST 2018,
> RpcRetryingCaller{globalStartTime=1518555467140, pause=100, retries=2},
> org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to
> pdmcdh01.xyz.com/192.168.145.62:60020 failed on local exception:
> org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection
> to pdmcdh01.xyz.com/192.168.145.62:60020 is closing. Call id=139,
> waitTime=13
>
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(
> RpcRetryingCaller.java:147)
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:935)
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:901)
> Our code reauthnticates and creates connection again but it still keeps
> failing
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=2, exceptions:
> Wed Feb 21 14:30:31 PST 2018,
> RpcRetryingCaller{globalStartTime=1519252219159, pause=100, retries=2},
> java.io.IOException: Couldn't setup connection for p...@hadoop.xyz.com to
> hbase/pdmcdh01.xyz@hadoop.xyz.com
> Wed Feb 21 14:30:31 PST 2018,
> RpcRetryingCaller{globalStartTime=1519252219159, pause=100, retries=2},
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the
> failed servers list: pdmcdh01.xyz.com/192.168.145.62:60020
>
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(
> RpcRetryingCaller.java:147)
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:935)
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:901)
> I know that client keeps server in the failed list for few seconds in order
> to reduce too many connection attempts. So I waited and tried after some
> time but still same error.
> Once we restart our application everything starts working fine again for
> next 24 hours.
>
> This 24 hours gap indicates that it could be something related to Kerberos
> ticket expiry time, however there is no log to indicate Kerberos
> authentication issue.
> Moreover we are handling the exception and trying to authenticate and
> create connection again but nothing works until we restart JVM. this is
> very strange.
>
> I would really appreciate any help or pointers on this issue.
>
> Thanks a lot
> Apratim
>


Re: How Long Will HBase Hold A Row Write Lock?

2018-03-10 Thread Saad Mufti
Also, for now we have mitigated this problem by using the new setting in
HBase 1.4.0 that prevents one slow region server from blocking all client
requests. Of course it causes some timeouts but our overall ecosystem
contains Kafka queues for retries, so we can live with that. From what I
can see, it looks like this setting also has the good effect of preventing
clients from hammering a region server that is slow because its IPC queues
are backed up, allowing it to recover faster.

Does that make sense?

Cheers.


Saad


On Sat, Mar 10, 2018 at 7:04 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> So if I understand correctly, we would mitigate the problem by not
> evicting blocks for archived files immediately? Wouldn't this potentially
> lead to problems later if the LRU algo chooses to evict blocks for active
> files and leave blocks for archived files in there?
>
> I would definitely love to test this!!! Unfortunately we are running on
> EMR and the details of how to patch HBase under EMR are not clear to me :-(
>
> What we would really love would be a setting for actually immediately
> caching blocks for a new compacted file. I have seen in the code that even
> is we have the cache on write setting set to true, it will refuse to cache
> blocks for a file that is a newly compacted one. In our case we have sized
> the bucket cache to be big enough to hold all our data, and really want to
> avoid having to go to S3 until the last possible moment. A config setting
> to test this would be great.
>
> But thanks everyone for your feedback. Any more would also be welcome on
> the idea to let a user cache all newly compacted files.
>
> 
> Saad
>
>
> On Wed, Mar 7, 2018 at 12:00 AM, Anoop John <anoop.hb...@gmail.com> wrote:
>
>> >>a) it was indeed one of the regions that was being compacted, major
>> compaction in one case, minor compaction in another, the issue started
>> just
>> after compaction completed blowing away bucket cached blocks for the older
>> HFile's
>>
>> About this part.Ya after the compaction, there is a step where the
>> compacted away HFile's blocks getting removed from cache. This op takes a
>> write lock for this region (In Bucket Cache layer)..  Every read op which
>> is part of checkAndPut will try read from BC and that in turn need a read
>> lock for this region.  So there is chances that the read locks starve
>> because of so many frequent write locks .  Each block evict will attain
>> the
>> write lock one after other.  Will it be possible for you to patch this
>> evict and test once? We can avoid the immediate evict from BC after
>> compaction. I can help you with a patch if you wish
>>
>> Anoop
>>
>>
>>
>> On Mon, Mar 5, 2018 at 11:07 AM, ramkrishna vasudevan <
>> ramkrishna.s.vasude...@gmail.com> wrote:
>> > Hi Saad
>> >
>> > Your argument here
>> >>> The
>> >>>theory is that since prefetch is an async operation, a lot of the reads
>> in
>> >>>the checkAndPut for the region in question start reading from S3 which
>> is
>> >>>slow. So the write lock obtained for the checkAndPut is held for a
>> longer
>> >>>duration than normal. This has cascading upstream effects. Does that
>> sound
>> >>>plausible?
>> >
>> > Seems very much plausible. So before even the prefetch happens say for
>> > 'block 1' - and you have already issues N checkAndPut calls for the rows
>> in
>> > that 'block 1' -  all those checkAndPut will have to read that block
>> from
>> > S3 to perform the get() and then apply the mutation.
>> >
>> > This may happen for multiple threads at the same time because we are not
>> > sure when the prefetch would have actually been completed. I don know
>> what
>> > are the general read characteristics when a read happens from S3 but you
>> > could try to see how things work when a read happens from S3 and after
>> the
>> > prefetch completes ensure the same checkandPut() is done (from cache
>> this
>> > time) to really know the difference what S3 does there.
>> >
>> > Regards
>> > Ram
>> >
>> > On Fri, Mar 2, 2018 at 2:57 AM, Saad Mufti <saad.mu...@gmail.com>
>> wrote:
>> >
>> >> So after much investigation I can confirm:
>> >>
>> >> a) it was indeed one of the regions that was being compacted, major
>> >> compaction in one case, minor compaction in another, the issue started
>> just
>> >> after compaction completed blowing away bu

TableSnapshotInputFormat Behavior In HBase 1.4.0

2018-03-10 Thread Saad Mufti
Hi,

I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
no Hbase installed on the cluster, only HBase libs linked to my Spark app.
We are reading the snapshot info from a HBase folder in S3 using
TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
snapshot info directly from the S3 based filesystem instead of going
through any region server.

I have observed a few behaviors while debugging performance that are
concerning, some we could mitigate and other I am looking for clarity on:

1)  the TableSnapshotInputFormatImpl code is trying to get locality
information for the region splits, for a snapshots with a large number of
files (over 35 in our case) this causing single threaded scan of all
the file listings in a single thread in the driver. And it was useless
because there is really no useful locality information to glean since all
the files are in S3 and not HDFS. So I was forced to make a copy of
TableSnapshotInputFormatImpl.java in our code and control this with a
config setting I made up. That got rid of the hours long scan, so I am good
with this part for now.

2) I have set a single column family in the Scan that I set on the hbase
configuration via

scan.addFamily(str.getBytes()))

hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))


But when this code is executing under Spark and I observe the threads and
logs on Spark executors, I it is reading from S3 files for a column family
that was not included in the scan. This column family was intentionally
excluded because it is much larger than the others and so we wanted to
avoid the cost.

Any advice on what I am doing wrong would be appreciated.

3) We also explicitly set caching of blocks to false on the scan, although
I see that in TableSnapshotInputFormatImpl.java it is again set to false
internally also. But when running the Spark job, some executors were taking
much longer than others, and when I observe their threads, I see periodic
messages about a few hundred megs of RAM used by the block cache, and the
thread is sitting there reading data from S3, and is occasionally blocked a
couple of other threads that have the "hfile-prefetcher" name in them.
Going back to 2) above, they seem to be reading the wrong column family,
but in this item I am more concerned about why they appear to be
prefetching blocks and caching them, when the Scan object has a setting to
not cache blocks at all?

Thanks in advance for any insights anyone can provide.


Saad


Re: How Long Will HBase Hold A Row Write Lock?

2018-03-10 Thread Saad Mufti
So if I understand correctly, we would mitigate the problem by not evicting
blocks for archived files immediately? Wouldn't this potentially lead to
problems later if the LRU algo chooses to evict blocks for active files and
leave blocks for archived files in there?

I would definitely love to test this!!! Unfortunately we are running on EMR
and the details of how to patch HBase under EMR are not clear to me :-(

What we would really love would be a setting for actually immediately
caching blocks for a new compacted file. I have seen in the code that even
is we have the cache on write setting set to true, it will refuse to cache
blocks for a file that is a newly compacted one. In our case we have sized
the bucket cache to be big enough to hold all our data, and really want to
avoid having to go to S3 until the last possible moment. A config setting
to test this would be great.

But thanks everyone for your feedback. Any more would also be welcome on
the idea to let a user cache all newly compacted files.


Saad


On Wed, Mar 7, 2018 at 12:00 AM, Anoop John <anoop.hb...@gmail.com> wrote:

> >>a) it was indeed one of the regions that was being compacted, major
> compaction in one case, minor compaction in another, the issue started just
> after compaction completed blowing away bucket cached blocks for the older
> HFile's
>
> About this part.Ya after the compaction, there is a step where the
> compacted away HFile's blocks getting removed from cache. This op takes a
> write lock for this region (In Bucket Cache layer)..  Every read op which
> is part of checkAndPut will try read from BC and that in turn need a read
> lock for this region.  So there is chances that the read locks starve
> because of so many frequent write locks .  Each block evict will attain the
> write lock one after other.  Will it be possible for you to patch this
> evict and test once? We can avoid the immediate evict from BC after
> compaction. I can help you with a patch if you wish
>
> Anoop
>
>
>
> On Mon, Mar 5, 2018 at 11:07 AM, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com> wrote:
> > Hi Saad
> >
> > Your argument here
> >>> The
> >>>theory is that since prefetch is an async operation, a lot of the reads
> in
> >>>the checkAndPut for the region in question start reading from S3 which
> is
> >>>slow. So the write lock obtained for the checkAndPut is held for a
> longer
> >>>duration than normal. This has cascading upstream effects. Does that
> sound
> >>>plausible?
> >
> > Seems very much plausible. So before even the prefetch happens say for
> > 'block 1' - and you have already issues N checkAndPut calls for the rows
> in
> > that 'block 1' -  all those checkAndPut will have to read that block from
> > S3 to perform the get() and then apply the mutation.
> >
> > This may happen for multiple threads at the same time because we are not
> > sure when the prefetch would have actually been completed. I don know
> what
> > are the general read characteristics when a read happens from S3 but you
> > could try to see how things work when a read happens from S3 and after
> the
> > prefetch completes ensure the same checkandPut() is done (from cache this
> > time) to really know the difference what S3 does there.
> >
> > Regards
> > Ram
> >
> > On Fri, Mar 2, 2018 at 2:57 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >
> >> So after much investigation I can confirm:
> >>
> >> a) it was indeed one of the regions that was being compacted, major
> >> compaction in one case, minor compaction in another, the issue started
> just
> >> after compaction completed blowing away bucket cached blocks for the
> older
> >> HFile's
> >> b) in another case there was no compaction just a newly opened region in
> a
> >> region server that hadn't finished perfetching its pages from S3
> >>
> >> We have prefetch on open set to true. Our load is heavy on checkAndPut
> .The
> >> theory is that since prefetch is an async operation, a lot of the reads
> in
> >> the checkAndPut for the region in question start reading from S3 which
> is
> >> slow. So the write lock obtained for the checkAndPut is held for a
> longer
> >> duration than normal. This has cascading upstream effects. Does that
> sound
> >> plausible?
> >>
> >> The part I don't understand still is all the locks held are for the same
> >> region but are all for different rows. So once the prefetch is
> completed,
> >> shouldn't the problem clear up quickly? Or does the slow region slow
&

Re: How Long Will HBase Hold A Row Write Lock?

2018-03-01 Thread Saad Mufti
So after much investigation I can confirm:

a) it was indeed one of the regions that was being compacted, major
compaction in one case, minor compaction in another, the issue started just
after compaction completed blowing away bucket cached blocks for the older
HFile's
b) in another case there was no compaction just a newly opened region in a
region server that hadn't finished perfetching its pages from S3

We have prefetch on open set to true. Our load is heavy on checkAndPut .The
theory is that since prefetch is an async operation, a lot of the reads in
the checkAndPut for the region in question start reading from S3 which is
slow. So the write lock obtained for the checkAndPut is held for a longer
duration than normal. This has cascading upstream effects. Does that sound
plausible?

The part I don't understand still is all the locks held are for the same
region but are all for different rows. So once the prefetch is completed,
shouldn't the problem clear up quickly? Or does the slow region slow down
anyone trying to do checkAndPut on any row in the same region even after
the prefetch has completed. That is, do the long held row locks prevent
others from getting a row lock on a different row in the same region?

In any case, we trying to use
https://issues.apache.org/jira/browse/HBASE-16388 support in HBase 1.4.0 to
both insulate the app a bit from this situation and hoping that it will
reduce pressure on the region server in question, allowing it to recover
faster. I haven't quite tested that yet, any advice in the meantime would
be appreciated.

Cheers.


Saad



On Thu, Mar 1, 2018 at 9:21 AM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Actually it happened again while some minior compactions were running, so
> don't think it related to our major compaction tool, which isn't even
> running right now. I will try to capture a debug dump of threads and
> everything while the event is ongoing. Seems to last at least half an hour
> or so and sometimes longer.
>
> 
> Saad
>
>
> On Thu, Mar 1, 2018 at 7:54 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
>> Unfortunately I lost the stack trace overnight. But it does seem related
>> to compaction, because now that the compaction tool is done, I don't see
>> the issue anymore. I will run our incremental major compaction tool again
>> and see if I can reproduce the issue.
>>
>> On the plus side the system stayed stable and eventually recovered,
>> although it did suffer all those timeouts.
>>
>> 
>> Saad
>>
>>
>> On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <saad.mu...@gmail.com>
>> wrote:
>>
>>> I'll paste a thread dump later, writing this from my phone  :-)
>>>
>>> So the same issue has happened at different times for different regions,
>>> but I couldn't see that the region in question was the one being compacted,
>>> either this time or earlier. Although I might have missed an earlier
>>> correlation in the logs where the issue started just after the compaction
>>> completed.
>>>
>>> Usually a compaction for this table's regions take around 5-10 minutes,
>>> much less for its smaller column family which is block cache enabled,
>>> around a minute or less, and 5-10 minutes for the much larger one for which
>>> we have block cache disabled in the schema, because we don't ever read it
>>> in the primary cluster. So the only impact on reads would be from that
>>> smaller column family which takes less than a minute to compact.
>>>
>>> But the issue once started doesn't seem to recover for a long time, long
>>> past when any compaction on the region itself could impact anything. The
>>> compaction tool which is our own code has long since moved to other
>>> regions.
>>>
>>> Cheers.
>>>
>>> 
>>> Saad
>>>
>>>
>>> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> bq. timing out trying to obtain write locks on rows in that region.
>>>>
>>>> Can you confirm that the region under contention was the one being major
>>>> compacted ?
>>>>
>>>> Can you pastebin thread dump so that we can have better idea of the
>>>> scenario ?
>>>>
>>>> For the region being compacted, how long would the compaction take (just
>>>> want to see if there was correlation between this duration and timeout)
>>>> ?
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <saad.mu...@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi,
&

Re: How Long Will HBase Hold A Row Write Lock?

2018-03-01 Thread Saad Mufti
Actually it happened again while some minior compactions were running, so
don't think it related to our major compaction tool, which isn't even
running right now. I will try to capture a debug dump of threads and
everything while the event is ongoing. Seems to last at least half an hour
or so and sometimes longer.


Saad


On Thu, Mar 1, 2018 at 7:54 AM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Unfortunately I lost the stack trace overnight. But it does seem related
> to compaction, because now that the compaction tool is done, I don't see
> the issue anymore. I will run our incremental major compaction tool again
> and see if I can reproduce the issue.
>
> On the plus side the system stayed stable and eventually recovered,
> although it did suffer all those timeouts.
>
> 
> Saad
>
>
> On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
>> I'll paste a thread dump later, writing this from my phone  :-)
>>
>> So the same issue has happened at different times for different regions,
>> but I couldn't see that the region in question was the one being compacted,
>> either this time or earlier. Although I might have missed an earlier
>> correlation in the logs where the issue started just after the compaction
>> completed.
>>
>> Usually a compaction for this table's regions take around 5-10 minutes,
>> much less for its smaller column family which is block cache enabled,
>> around a minute or less, and 5-10 minutes for the much larger one for which
>> we have block cache disabled in the schema, because we don't ever read it
>> in the primary cluster. So the only impact on reads would be from that
>> smaller column family which takes less than a minute to compact.
>>
>> But the issue once started doesn't seem to recover for a long time, long
>> past when any compaction on the region itself could impact anything. The
>> compaction tool which is our own code has long since moved to other
>> regions.
>>
>> Cheers.
>>
>> 
>> Saad
>>
>>
>> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> bq. timing out trying to obtain write locks on rows in that region.
>>>
>>> Can you confirm that the region under contention was the one being major
>>> compacted ?
>>>
>>> Can you pastebin thread dump so that we can have better idea of the
>>> scenario ?
>>>
>>> For the region being compacted, how long would the compaction take (just
>>> want to see if there was correlation between this duration and timeout) ?
>>>
>>> Cheers
>>>
>>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <saad.mu...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > We are running on Amazon EMR based HBase 1.4.0 . We are currently
>>> seeing a
>>> > situation where sometimes a particular region gets into a situation
>>> where a
>>> > lot of write requests to any row in that region timeout saying they
>>> failed
>>> > to obtain a lock on a row in a region and eventually they experience
>>> an IPC
>>> > timeout. This causes the IPC queue to blow up in size as requests get
>>> > backed up, and that region server experiences a much higher than normal
>>> > timeout rate for all requests, not just those timing out for failing to
>>> > obtain the row lock.
>>> >
>>> > The strange thing is the rows are always different but the region is
>>> always
>>> > the same. So the question is, is there a region component to how long
>>> a row
>>> > write lock would be held? I looked at the debug dump and the RowLocks
>>> > section shows a long list of write row locks held, all of them are
>>> from the
>>> > same region but different rows.
>>> >
>>> > Will trying to obtain a write row lock experience delays if no one else
>>> > holds a lock on the same row but the region itself is experiencing read
>>> > delays? We do have an incremental compaction tool running that major
>>> > compacts one region per region server at a time, so that will drive out
>>> > pages from the bucket cache. But for most regions the impact is
>>> > transitional until the bucket cache gets populated by pages from the
>>> new
>>> > HFile. But for this one region we start timing out trying to obtain
>>> write
>>> > locks on rows in that region.
>>> >
>>> > Any insight anyone can provide would be most welcome.
>>> >
>>> > Cheers.
>>> >
>>> > 
>>> > Saad
>>> >
>>>
>>
>


Re: How Long Will HBase Hold A Row Write Lock?

2018-03-01 Thread Saad Mufti
Unfortunately I lost the stack trace overnight. But it does seem related to
compaction, because now that the compaction tool is done, I don't see the
issue anymore. I will run our incremental major compaction tool again and
see if I can reproduce the issue.

On the plus side the system stayed stable and eventually recovered,
although it did suffer all those timeouts.


Saad


On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> I'll paste a thread dump later, writing this from my phone  :-)
>
> So the same issue has happened at different times for different regions,
> but I couldn't see that the region in question was the one being compacted,
> either this time or earlier. Although I might have missed an earlier
> correlation in the logs where the issue started just after the compaction
> completed.
>
> Usually a compaction for this table's regions take around 5-10 minutes,
> much less for its smaller column family which is block cache enabled,
> around a minute or less, and 5-10 minutes for the much larger one for which
> we have block cache disabled in the schema, because we don't ever read it
> in the primary cluster. So the only impact on reads would be from that
> smaller column family which takes less than a minute to compact.
>
> But the issue once started doesn't seem to recover for a long time, long
> past when any compaction on the region itself could impact anything. The
> compaction tool which is our own code has long since moved to other
> regions.
>
> Cheers.
>
> 
> Saad
>
>
> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com> wrote:
>
>> bq. timing out trying to obtain write locks on rows in that region.
>>
>> Can you confirm that the region under contention was the one being major
>> compacted ?
>>
>> Can you pastebin thread dump so that we can have better idea of the
>> scenario ?
>>
>> For the region being compacted, how long would the compaction take (just
>> want to see if there was correlation between this duration and timeout) ?
>>
>> Cheers
>>
>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > We are running on Amazon EMR based HBase 1.4.0 . We are currently
>> seeing a
>> > situation where sometimes a particular region gets into a situation
>> where a
>> > lot of write requests to any row in that region timeout saying they
>> failed
>> > to obtain a lock on a row in a region and eventually they experience an
>> IPC
>> > timeout. This causes the IPC queue to blow up in size as requests get
>> > backed up, and that region server experiences a much higher than normal
>> > timeout rate for all requests, not just those timing out for failing to
>> > obtain the row lock.
>> >
>> > The strange thing is the rows are always different but the region is
>> always
>> > the same. So the question is, is there a region component to how long a
>> row
>> > write lock would be held? I looked at the debug dump and the RowLocks
>> > section shows a long list of write row locks held, all of them are from
>> the
>> > same region but different rows.
>> >
>> > Will trying to obtain a write row lock experience delays if no one else
>> > holds a lock on the same row but the region itself is experiencing read
>> > delays? We do have an incremental compaction tool running that major
>> > compacts one region per region server at a time, so that will drive out
>> > pages from the bucket cache. But for most regions the impact is
>> > transitional until the bucket cache gets populated by pages from the new
>> > HFile. But for this one region we start timing out trying to obtain
>> write
>> > locks on rows in that region.
>> >
>> > Any insight anyone can provide would be most welcome.
>> >
>> > Cheers.
>> >
>> > 
>> > Saad
>> >
>>
>


Re: How Long Will HBase Hold A Row Write Lock?

2018-02-28 Thread Saad Mufti
I'll paste a thread dump later, writing this from my phone  :-)

So the same issue has happened at different times for different regions,
but I couldn't see that the region in question was the one being compacted,
either this time or earlier. Although I might have missed an earlier
correlation in the logs where the issue started just after the compaction
completed.

Usually a compaction for this table's regions take around 5-10 minutes,
much less for its smaller column family which is block cache enabled,
around a minute or less, and 5-10 minutes for the much larger one for which
we have block cache disabled in the schema, because we don't ever read it
in the primary cluster. So the only impact on reads would be from that
smaller column family which takes less than a minute to compact.

But the issue once started doesn't seem to recover for a long time, long
past when any compaction on the region itself could impact anything. The
compaction tool which is our own code has long since moved to other
regions.

Cheers.


Saad


On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com> wrote:

> bq. timing out trying to obtain write locks on rows in that region.
>
> Can you confirm that the region under contention was the one being major
> compacted ?
>
> Can you pastebin thread dump so that we can have better idea of the
> scenario ?
>
> For the region being compacted, how long would the compaction take (just
> want to see if there was correlation between this duration and timeout) ?
>
> Cheers
>
> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Hi,
> >
> > We are running on Amazon EMR based HBase 1.4.0 . We are currently seeing
> a
> > situation where sometimes a particular region gets into a situation
> where a
> > lot of write requests to any row in that region timeout saying they
> failed
> > to obtain a lock on a row in a region and eventually they experience an
> IPC
> > timeout. This causes the IPC queue to blow up in size as requests get
> > backed up, and that region server experiences a much higher than normal
> > timeout rate for all requests, not just those timing out for failing to
> > obtain the row lock.
> >
> > The strange thing is the rows are always different but the region is
> always
> > the same. So the question is, is there a region component to how long a
> row
> > write lock would be held? I looked at the debug dump and the RowLocks
> > section shows a long list of write row locks held, all of them are from
> the
> > same region but different rows.
> >
> > Will trying to obtain a write row lock experience delays if no one else
> > holds a lock on the same row but the region itself is experiencing read
> > delays? We do have an incremental compaction tool running that major
> > compacts one region per region server at a time, so that will drive out
> > pages from the bucket cache. But for most regions the impact is
> > transitional until the bucket cache gets populated by pages from the new
> > HFile. But for this one region we start timing out trying to obtain write
> > locks on rows in that region.
> >
> > Any insight anyone can provide would be most welcome.
> >
> > Cheers.
> >
> > 
> > Saad
> >
>


Re: Bucket Cache Failure In HBase 1.3.1

2018-02-28 Thread Saad Mufti
I think it is for HBASE itself. But I'll have to wait for more details as
they haven't shared the source code with us. I imagine they want to do a
bunch more testing and other process stuff.


Saad

On Wed, Feb 28, 2018 at 9:45 PM Ted Yu <yuzhih...@gmail.com> wrote:

> Did the vendor say whether the patch is for hbase or some other component ?
>
> Thanks
>
> On Wed, Feb 28, 2018 at 6:33 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Thanks for the feedback, so you guys are right the bucket cache is
> getting
> > disabled due to too many I/O errors from the underlying files making up
> the
> > bucket cache. Still do not know the exact underlying cause, but we are
> > working with our vendor to test a patch they provided that seems to have
> > resolved the issue for now. They say if it works out well they will
> > eventually try to promote the patch to the open source versions.
> >
> > Cheers.
> >
> > 
> > Saad
> >
> >
> > On Sun, Feb 25, 2018 at 11:10 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > Here is related code for disabling bucket cache:
> > >
> > > if (this.ioErrorStartTime > 0) {
> > >
> > >   if (cacheEnabled && (now - ioErrorStartTime) > this.
> > > ioErrorsTolerationDuration) {
> > >
> > > LOG.error("IO errors duration time has exceeded " +
> > > ioErrorsTolerationDuration +
> > >
> > >   "ms, disabling cache, please check your IOEngine");
> > >
> > > disableCache();
> > >
> > > Can you search in the region server log to see if the above occurred ?
> > >
> > > Was this server the only one with disabled cache ?
> > >
> > > Cheers
> > >
> > > On Sun, Feb 25, 2018 at 6:20 AM, Saad Mufti
> <saad.mu...@oath.com.invalid
> > >
> > > wrote:
> > >
> > > > HI,
> > > >
> > > > I am running an HBase 1.3.1 cluster on AWS EMR. The bucket cache is
> > > > configured to use two attached EBS disks of 50 GB each and I
> > provisioned
> > > > the bucket cache to be a bit less than the total, at a total of 98 GB
> > per
> > > > instance to be on the safe side. My tables have column families set
> to
> > > > prefetch on open.
> > > >
> > > > On some instances during cluster startup, the bucket cache starts
> > > throwing
> > > > errors, and eventually the bucket cache gets completely disabled on
> > this
> > > > instance. The instance still stays up as a valid region server and
> the
> > > only
> > > > clue in the region server UI is that the bucket cache tab reports a
> > count
> > > > of 0, and size of 0 bytes.
> > > >
> > > > I have already opened a ticket with AWS to see if there are problems
> > with
> > > > the EBS volumes, but wanted to tap the open source community's
> > hive-mind
> > > to
> > > > see what kind of problem would cause the bucket cache to get
> disabled.
> > If
> > > > the application depends on the bucket cache for performance, wouldn't
> > it
> > > be
> > > > better to just remove that region server from the pool if its bucket
> > > cache
> > > > cannot be recovered/enabled?
> > > >
> > > > The error look like the following. Would appreciate any insight,
> thank:
> > > >
> > > > 2018-02-25 01:12:47,780 ERROR [hfile-prefetch-1519513834057]
> > > > bucket.BucketCache: Failed reading block
> > > > 332b0634287f4c42851bc1a55ffe4042_1348128 from bucket cache
> > > > java.nio.channels.ClosedByInterruptException
> > > > at
> > > > java.nio.channels.spi.AbstractInterruptibleChannel.end(
> > > > AbstractInterruptibleChannel.java:202)
> > > > at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.
> > > > java:746)
> > > > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
> > > > at
> > > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine$
> > > > FileReadAccessor.access(FileIOEngine.java:219)
> > > > at
> > > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.
> > > > accessFile(FileIOEngine.java:170)
> > > > at
> > > > org.apache.hadoop.hbase.io.hfile.

Re: How Long Will HBase Hold A Row Write Lock?

2018-02-28 Thread Saad Mufti
One additional data point, I tried to manually re-assign the region in
question from the shell, that for some reason caused the region server to
restart and the region did get assigned to another region server. But then
the problem moved to that region server almost immediately.

Does that just mean our write load is disproportionately hitting that one
region? We have a prefix scheme in place for all our keys where we prepend
an MD5 hash based 4 digit prefix to all keys to make sure we get good
randomization, so that would be surprising.

As usual any feedback would be appreciated.

Cheers.


Saad



On Wed, Feb 28, 2018 at 9:31 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Hi,
>
> We are running on Amazon EMR based HBase 1.4.0 . We are currently seeing a
> situation where sometimes a particular region gets into a situation where a
> lot of write requests to any row in that region timeout saying they failed
> to obtain a lock on a row in a region and eventually they experience an IPC
> timeout. This causes the IPC queue to blow up in size as requests get
> backed up, and that region server experiences a much higher than normal
> timeout rate for all requests, not just those timing out for failing to
> obtain the row lock.
>
> The strange thing is the rows are always different but the region is
> always the same. So the question is, is there a region component to how
> long a row write lock would be held? I looked at the debug dump and the
> RowLocks section shows a long list of write row locks held, all of them are
> from the same region but different rows.
>
> Will trying to obtain a write row lock experience delays if no one else
> holds a lock on the same row but the region itself is experiencing read
> delays? We do have an incremental compaction tool running that major
> compacts one region per region server at a time, so that will drive out
> pages from the bucket cache. But for most regions the impact is
> transitional until the bucket cache gets populated by pages from the new
> HFile. But for this one region we start timing out trying to obtain write
> locks on rows in that region.
>
> Any insight anyone can provide would be most welcome.
>
> Cheers.
>
> 
> Saad
>
>


Re: Bucket Cache Failure In HBase 1.3.1

2018-02-28 Thread Saad Mufti
Thanks, see my other reply. We have a patch from the vendor but until it
gets promoted to open source we still don't know the real underlying cause,
but you're right the cache got disabled due to too many I/O errors in a
short timespan.

Cheers.


Saad


On Mon, Feb 26, 2018 at 12:24 AM, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> From the logs, it seems there were some issue with the file that was used
> by the bucket cache. Probably the volume where the file was mounted had
> some issues.
> If you can confirm that , then this issue should be pretty straightforward.
> If not let us know, we can help.
>
> Regards
> Ram
>
> On Sun, Feb 25, 2018 at 9:40 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > Here is related code for disabling bucket cache:
> >
> > if (this.ioErrorStartTime > 0) {
> >
> >   if (cacheEnabled && (now - ioErrorStartTime) > this.
> > ioErrorsTolerationDuration) {
> >
> > LOG.error("IO errors duration time has exceeded " +
> > ioErrorsTolerationDuration +
> >
> >   "ms, disabling cache, please check your IOEngine");
> >
> > disableCache();
> >
> > Can you search in the region server log to see if the above occurred ?
> >
> > Was this server the only one with disabled cache ?
> >
> > Cheers
> >
> > On Sun, Feb 25, 2018 at 6:20 AM, Saad Mufti <saad.mu...@oath.com.invalid
> >
> > wrote:
> >
> > > HI,
> > >
> > > I am running an HBase 1.3.1 cluster on AWS EMR. The bucket cache is
> > > configured to use two attached EBS disks of 50 GB each and I
> provisioned
> > > the bucket cache to be a bit less than the total, at a total of 98 GB
> per
> > > instance to be on the safe side. My tables have column families set to
> > > prefetch on open.
> > >
> > > On some instances during cluster startup, the bucket cache starts
> > throwing
> > > errors, and eventually the bucket cache gets completely disabled on
> this
> > > instance. The instance still stays up as a valid region server and the
> > only
> > > clue in the region server UI is that the bucket cache tab reports a
> count
> > > of 0, and size of 0 bytes.
> > >
> > > I have already opened a ticket with AWS to see if there are problems
> with
> > > the EBS volumes, but wanted to tap the open source community's
> hive-mind
> > to
> > > see what kind of problem would cause the bucket cache to get disabled.
> If
> > > the application depends on the bucket cache for performance, wouldn't
> it
> > be
> > > better to just remove that region server from the pool if its bucket
> > cache
> > > cannot be recovered/enabled?
> > >
> > > The error look like the following. Would appreciate any insight, thank:
> > >
> > > 2018-02-25 01:12:47,780 ERROR [hfile-prefetch-1519513834057]
> > > bucket.BucketCache: Failed reading block
> > > 332b0634287f4c42851bc1a55ffe4042_1348128 from bucket cache
> > > java.nio.channels.ClosedByInterruptException
> > > at
> > > java.nio.channels.spi.AbstractInterruptibleChannel.end(
> > > AbstractInterruptibleChannel.java:202)
> > > at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.
> > > java:746)
> > > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
> > > at
> > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine$
> > > FileReadAccessor.access(FileIOEngine.java:219)
> > > at
> > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.
> > > accessFile(FileIOEngine.java:170)
> > > at
> > > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.
> > > read(FileIOEngine.java:105)
> > > at
> > > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.
> > > getBlock(BucketCache.java:492)
> > > at
> > > org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.
> > > getBlock(CombinedBlockCache.java:84)
> > > at
> > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.
> > > getCachedBlock(HFileReaderV2.java:279)
> > > at
> > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(
> > > HFileReaderV2.java:420)
> > > at
> > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$1.run(
> > > HFileReaderV2.java:209)
> > > at
> > > java.util.concurrent.Executors$RunnableAda

Re: Bucket Cache Failure In HBase 1.3.1

2018-02-28 Thread Saad Mufti
Thanks for the feedback, so you guys are right the bucket cache is getting
disabled due to too many I/O errors from the underlying files making up the
bucket cache. Still do not know the exact underlying cause, but we are
working with our vendor to test a patch they provided that seems to have
resolved the issue for now. They say if it works out well they will
eventually try to promote the patch to the open source versions.

Cheers.


Saad


On Sun, Feb 25, 2018 at 11:10 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Here is related code for disabling bucket cache:
>
> if (this.ioErrorStartTime > 0) {
>
>   if (cacheEnabled && (now - ioErrorStartTime) > this.
> ioErrorsTolerationDuration) {
>
> LOG.error("IO errors duration time has exceeded " +
> ioErrorsTolerationDuration +
>
>   "ms, disabling cache, please check your IOEngine");
>
> disableCache();
>
> Can you search in the region server log to see if the above occurred ?
>
> Was this server the only one with disabled cache ?
>
> Cheers
>
> On Sun, Feb 25, 2018 at 6:20 AM, Saad Mufti <saad.mu...@oath.com.invalid>
> wrote:
>
> > HI,
> >
> > I am running an HBase 1.3.1 cluster on AWS EMR. The bucket cache is
> > configured to use two attached EBS disks of 50 GB each and I provisioned
> > the bucket cache to be a bit less than the total, at a total of 98 GB per
> > instance to be on the safe side. My tables have column families set to
> > prefetch on open.
> >
> > On some instances during cluster startup, the bucket cache starts
> throwing
> > errors, and eventually the bucket cache gets completely disabled on this
> > instance. The instance still stays up as a valid region server and the
> only
> > clue in the region server UI is that the bucket cache tab reports a count
> > of 0, and size of 0 bytes.
> >
> > I have already opened a ticket with AWS to see if there are problems with
> > the EBS volumes, but wanted to tap the open source community's hive-mind
> to
> > see what kind of problem would cause the bucket cache to get disabled. If
> > the application depends on the bucket cache for performance, wouldn't it
> be
> > better to just remove that region server from the pool if its bucket
> cache
> > cannot be recovered/enabled?
> >
> > The error look like the following. Would appreciate any insight, thank:
> >
> > 2018-02-25 01:12:47,780 ERROR [hfile-prefetch-1519513834057]
> > bucket.BucketCache: Failed reading block
> > 332b0634287f4c42851bc1a55ffe4042_1348128 from bucket cache
> > java.nio.channels.ClosedByInterruptException
> > at
> > java.nio.channels.spi.AbstractInterruptibleChannel.end(
> > AbstractInterruptibleChannel.java:202)
> > at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.
> > java:746)
> > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
> > at
> > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine$
> > FileReadAccessor.access(FileIOEngine.java:219)
> > at
> > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.
> > accessFile(FileIOEngine.java:170)
> > at
> > org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.
> > read(FileIOEngine.java:105)
> > at
> > org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.
> > getBlock(BucketCache.java:492)
> > at
> > org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.
> > getBlock(CombinedBlockCache.java:84)
> > at
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.
> > getCachedBlock(HFileReaderV2.java:279)
> > at
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(
> > HFileReaderV2.java:420)
> > at
> > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$1.run(
> > HFileReaderV2.java:209)
> > at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(
> > ScheduledThreadPoolExecutor.java:293)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:624)
> > at java

How Long Will HBase Hold A Row Write Lock?

2018-02-28 Thread Saad Mufti
Hi,

We are running on Amazon EMR based HBase 1.4.0 . We are currently seeing a
situation where sometimes a particular region gets into a situation where a
lot of write requests to any row in that region timeout saying they failed
to obtain a lock on a row in a region and eventually they experience an IPC
timeout. This causes the IPC queue to blow up in size as requests get
backed up, and that region server experiences a much higher than normal
timeout rate for all requests, not just those timing out for failing to
obtain the row lock.

The strange thing is the rows are always different but the region is always
the same. So the question is, is there a region component to how long a row
write lock would be held? I looked at the debug dump and the RowLocks
section shows a long list of write row locks held, all of them are from the
same region but different rows.

Will trying to obtain a write row lock experience delays if no one else
holds a lock on the same row but the region itself is experiencing read
delays? We do have an incremental compaction tool running that major
compacts one region per region server at a time, so that will drive out
pages from the bucket cache. But for most regions the impact is
transitional until the bucket cache gets populated by pages from the new
HFile. But for this one region we start timing out trying to obtain write
locks on rows in that region.

Any insight anyone can provide would be most welcome.

Cheers.


Saad


Bucket Cache Failure In HBase 1.3.1

2018-02-25 Thread Saad Mufti
HI,

I am running an HBase 1.3.1 cluster on AWS EMR. The bucket cache is
configured to use two attached EBS disks of 50 GB each and I provisioned
the bucket cache to be a bit less than the total, at a total of 98 GB per
instance to be on the safe side. My tables have column families set to
prefetch on open.

On some instances during cluster startup, the bucket cache starts throwing
errors, and eventually the bucket cache gets completely disabled on this
instance. The instance still stays up as a valid region server and the only
clue in the region server UI is that the bucket cache tab reports a count
of 0, and size of 0 bytes.

I have already opened a ticket with AWS to see if there are problems with
the EBS volumes, but wanted to tap the open source community's hive-mind to
see what kind of problem would cause the bucket cache to get disabled. If
the application depends on the bucket cache for performance, wouldn't it be
better to just remove that region server from the pool if its bucket cache
cannot be recovered/enabled?

The error look like the following. Would appreciate any insight, thank:

2018-02-25 01:12:47,780 ERROR [hfile-prefetch-1519513834057]
bucket.BucketCache: Failed reading block
332b0634287f4c42851bc1a55ffe4042_1348128 from bucket cache
java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:746)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
at
org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine$FileReadAccessor.access(FileIOEngine.java:219)
at
org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.accessFile(FileIOEngine.java:170)
at
org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.read(FileIOEngine.java:105)
at
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.getBlock(BucketCache.java:492)
at
org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.getBlock(CombinedBlockCache.java:84)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.getCachedBlock(HFileReaderV2.java:279)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:420)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$1.run(HFileReaderV2.java:209)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

and

2018-02-25 01:12:52,432 ERROR [regionserver/
ip-xx-xx-xx-xx.xx-xx-xx.us-east-1.ec2.xx.net/xx.xx.xx.xx:16020-BucketCacheWriter-7]
bucket.BucketCache: Failed writing to bucket cache
java.nio.channels.ClosedChannelException
at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:758)
at
org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine$FileWriteAccessor.access(FileIOEngine.java:227)
at
org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.accessFile(FileIOEngine.java:170)
at
org.apache.hadoop.hbase.io.hfile.bucket.FileIOEngine.write(FileIOEngine.java:116)
at
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$RAMQueueEntry.writeToCache(BucketCache.java:1357)
at
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$WriterThread.doDrain(BucketCache.java:883)
at
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$WriterThread.run(BucketCache.java:838)
at java.lang.Thread.run(Thread.java:748)

and later
2018-02-25 01:13:47,783 INFO  [regionserver/
ip-10-194-246-70.aolp-ds-dev.us-east-1.ec2.aolcloud.net/10.194.246.70:16020-BucketCacheWriter-4]
bucket.BucketCach
e: regionserver/
ip-10-194-246-70.aolp-ds-dev.us-east-1.ec2.aolcloud.net/10.194.246.70:16020-BucketCacheWriter-4
exiting, cacheEnabled=false
2018-02-25 01:13:47,864 WARN  [regionserver/
ip-10-194-246-70.aolp-ds-dev.us-east-1.ec2.aolcloud.net/10.194.246.70:16020-BucketCacheWriter-6]
bucket.FileIOEngi
ne: Failed syncing data to /mnt1/hbase/bucketcache
2018-02-25 01:13:47,864 ERROR [regionserver/
ip-10-194-246-70.aolp-ds-dev.us-east-1.ec2.aolcloud.net/10.194.246.70:16020-BucketCacheWriter-6]
bucket.BucketCach
e: Failed syncing IO engine
java.nio.channels.ClosedChannelException
at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:379)
at

Re: Trying To Understand BucketCache Evictions In HBase 1.3.1

2018-02-19 Thread Saad Mufti
Thanks, it all makes sense now.

Cheers.


Saad

On Mon, Feb 19, 2018 at 5:40 AM Anoop John <anoop.hb...@gmail.com> wrote:

> Hi
>   Seems you have write ops happening as you mentioned abt
> minor compactions.  When the compaction happens, the compacted file's
> blocks will get evicted.  Whatever be the value of
> 'hbase.rs.evictblocksonclose'.  This config comes to play when the
> Store is closed. Means the region movement is happening or split and
> so a close on stores. Also the table might get disabled or deleted.
> All such store close cases this config comes to picture.  But minor
> compactions means there will be evictions.  These are not via the
> eviction threads which monitor the less spaces and select LRU blocks
> for eviction.  This is done by the compaction threads. That is why you
> can see the evict ops (done by Eviction thread) is zero but the
> #evicted blocks are there.  Those might be the blocks of the compacted
> away files.  Hope this helps you to understand what is going on.
>
> -Anoop-
>
>
> On Mon, Feb 19, 2018 at 5:25 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> > Sorry I meant BLOCKCACHE => 'false' on the one column family we don't
> want
> > getting cached.
> >
> > Cheers.
> >
> > 
> > Saad
> >
> >
> > On Sun, Feb 18, 2018 at 6:51 PM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> We have an HBase system running HBase 1.3.1 on an AWS EMR service. Our
> >> BucketCache is configured for 400 GB on a set of attached EBS disk
> volumes,
> >> with all column families marked for in-memory in their column family
> >> schemas using INMEMORY => 'true' (except for one column family we only
> ever
> >> write to, so we set BUCKETCACHE => 'false' on that one).
> >>
> >> Even though all column families are marked INMEMORY, we have the
> following
> >> ratios set:
> >>
> >> "hbase.bucketcache.memory.factor":"0.8",
> >>
> >> "hbase.bucketcache.single.factor":"0.1",
> >>
> >>
> >> "hbase.bucketcache.multi.factor":"0.1",
> >>
> >> Currently the bucket cache shows evictions even though it has tons of
> free
> >> space. I am trying to understand why we get any evictions at all? We do
> >> have minor compactions going on, but we have not set
> hbase.rs.evictblocksonclose
> >> to any value and from looking at the code, it defaults to false. The
> total
> >> bucket cache size is nowhere near any of the above limits, in fact on
> some
> >> long running servers where we stopped traffic, the cache size went down
> to
> >> 0. Which makes me think something is evicting blocks from the bucket
> cache
> >> in the background.
> >>
> >> You can see a screenshot from one of the regionserver L2 stats UI pages
> at
> >> https://imgur.com/a/2ZUSv . Another interesting thing to me on this
> page
> >> is that it has non-zero evicted blocks but says Evictions: 0
> >>
> >> Any help understanding this would be appreciated.
> >>
> >> 
> >> Saad
> >>
> >>
>


Trying To Understand BucketCache Evictions In HBase 1.3.1

2018-02-18 Thread Saad Mufti
Hi,

We have an HBase system running HBase 1.3.1 on an AWS EMR service. Our
BucketCache is configured for 400 GB on a set of attached EBS disk volumes,
with all column families marked for in-memory in their column family
schemas using INMEMORY => 'true' (except for one column family we only ever
write to, so we set BUCKETCACHE => 'false' on that one).

Even though all column families are marked INMEMORY, we have the following
ratios set:

"hbase.bucketcache.memory.factor":"0.8",

"hbase.bucketcache.single.factor":"0.1",

"hbase.bucketcache.multi.factor":"0.1",

Currently the bucket cache shows evictions even though it has tons of free
space. I am trying to understand why we get any evictions at all? We do
have minor compactions going on, but we have not
set hbase.rs.evictblocksonclose to any value and from looking at the code,
it defaults to false. The total bucket cache size is nowhere near any of
the above limits, in fact on some long running servers where we stopped
traffic, the cache size went down to 0. Which makes me think something is
evicting blocks from the bucket cache in the background.

You can see a screenshot from one of the regionserver L2 stats UI pages at
https://imgur.com/a/2ZUSv . Another interesting thing to me on this page is
that it has non-zero evicted blocks but says Evictions: 0

Any help understanding this would be appreciated.


Saad


Re: HBase Encryption - HDFS Vs HBase Level

2017-08-18 Thread Saad Mufti
Thank you everyone for the feedback. It was very helpful.

Cheers.

---
Saad Mufti


On Fri, Aug 18, 2017 at 3:20 PM, Andrew Purtell <apurt...@apache.org> wrote:

> The Hadoop KMS in 2.6 or 2.7 can be suitable for demos or prototypes but I
> would advise against using it for more than that. Recently the KMS has seen
> a number of security improvements. Because it is fairly self contained, you
> can check out branch-2.8 or branch-2, build everything, extract the KMS,
> and use that.
>
> For what it is worth at my employer we are considering HDFS at rest
> encryption. We are building our own key management infrastructure,
> incorporating various security and business requirements, and will
> implement to the KMS on-wire API for providing key management services to
> HDFS.
>
>
>
>
> On Fri, Aug 18, 2017 at 10:25 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Hi,
> >
> > I'm looking for some guidance as our security team is requiring us to
> > implement encryption of our HBase data at rest and in motion. I'm reading
> > the docs and doing research and the choice seems to be between doing it
> at
> > the HBase level or the more general HDFS level.
> >
> > I am leaning towards HDFS level as there is some other data that is
> derived
> > from HBase in HDFS and it would be nice to have that encrypted as well.
> > Once set up the encryption is supposed to transparent to clients. We're
> > still at HBase 1.0 level, we're using a Cloudera 5.5 based distribution
> but
> > no commercial license. For reasons I won't go into upgrading is not an
> > option in the short term and we need to implement encryption before that
> >
> > But I have a warning in a google groups somewhere (can't find it anymore)
> > that warns that HDFS level encryption doesn't play well with HBase if on
> > Hadoop 2.6.x, which we're at. Does anyone know the specific issue, or if
> > there is a specific ticket I can look at to see if our Hadoop distro
> > includes that fix?
> >
> > Also, out of the box the Key Management Server included in Hadoop is
> based
> > on a simple file based Java Keystore and there are warnings that it is
> not
> > suitable for production environments. Cloudera has their own proprietary
> > KMS but we don't have a license to it. Can anyone share what groups that
> > use pure open source distros are using as their KMS when implementing
> > encryption in production environments?
> >
> > Thanks in advance for any guidance you can provide.
> >
> > 
> > Saad
> >
>
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>- A23, Crosstalk
>


HBase Encryption - HDFS Vs HBase Level

2017-08-18 Thread Saad Mufti
Hi,

I'm looking for some guidance as our security team is requiring us to
implement encryption of our HBase data at rest and in motion. I'm reading
the docs and doing research and the choice seems to be between doing it at
the HBase level or the more general HDFS level.

I am leaning towards HDFS level as there is some other data that is derived
from HBase in HDFS and it would be nice to have that encrypted as well.
Once set up the encryption is supposed to transparent to clients. We're
still at HBase 1.0 level, we're using a Cloudera 5.5 based distribution but
no commercial license. For reasons I won't go into upgrading is not an
option in the short term and we need to implement encryption before that

But I have a warning in a google groups somewhere (can't find it anymore)
that warns that HDFS level encryption doesn't play well with HBase if on
Hadoop 2.6.x, which we're at. Does anyone know the specific issue, or if
there is a specific ticket I can look at to see if our Hadoop distro
includes that fix?

Also, out of the box the Key Management Server included in Hadoop is based
on a simple file based Java Keystore and there are warnings that it is not
suitable for production environments. Cloudera has their own proprietary
KMS but we don't have a license to it. Can anyone share what groups that
use pure open source distros are using as their KMS when implementing
encryption in production environments?

Thanks in advance for any guidance you can provide.


Saad


Re: HBase 1.0 Per Put TTL Not Being Obeyed On Replication

2017-05-01 Thread Saad Mufti
Thx. Will try and see what I can find.


Saad

On Mon, May 1, 2017 at 5:41 AM Anoop John <anoop.hb...@gmail.com> wrote:

> At server side (RS) as well as at client side, put the config
> "hbase.client.rpc.codec" with a value
> org.apache.hadoop.hbase.codec.KeyValueCodecWithTags. Then u will
> be able to retrieve the tags back to client side and check
>
> -Anoop-
>
> On Mon, May 1, 2017 at 2:59 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> > Is there any facility to check what tags are on a Cell from a client side
> > program? I started writing some Java code to look at the tags on a Cell
> > retrieved via a simple Get, but then started reading around and it seems
> > tags are not even returned (not returned at all or only in certain cases,
> > I'm not clear) to client side code. So how can I verify that a Cell in
> one
> > cluster has the TTL tag whereas the same replicated C3ell in the next
> > cluster does or doesn't?
> >
> > Thanks.
> >
> > 
> > Saad
> >
> >
> > On Fri, Apr 28, 2017 at 1:06 PM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >
> >> Thanks for the feedback, I have confirmed that in both the main and
> >> replica cluster, hbase.replication.rpc.codec is set to:
> >>
> >> org.apache.hadoop.hbase.codec.KeyValueCodecWithTags
> >>
> >> I have also run a couple of tests and it looks like the TTL is not being
> >> obeyed on the replica for any entry. Almost as if the TTL cell tags are
> not
> >> being replicated. I couldn't find any significant clock skew. If it
> >> matters, the HBase version on both sides is 1.0.0-cdh5.5.2
> >>
> >> Any ideas?
> >>
> >> Thanks.
> >>
> >> 
> >> Saad
> >>
> >>
> >> On Thu, Apr 27, 2017 at 3:24 AM, Anoop John <anoop.hb...@gmail.com>
> wrote:
> >>
> >>> Ya can u check whether the replica cluster is NOT removing ANY of the
> >>> TTL expired cells (as per ur expectation from master cluster) or some.
> >>> Is there too much clock time skew for the source RS and peer cluster
> >>> RS? Just check.
> >>>
> >>> BTW can u see what is the hbase.replication.rpc.codec configuration
> >>> value in both clusters?
> >>>
> >>> -Anoop-
> >>>
> >>> On Thu, Apr 27, 2017 at 2:08 AM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >>> > Hi,
> >>> >
> >>> > I have a main HBase 1.x cluster and some of the tables are being
> >>> replicated
> >>> > to a separate HBase cluster of the same version, and the table
> schemas
> >>> are
> >>> > identical. The column family being used has TTL set to "FOREVER", but
> >>> we do
> >>> > a per put TTL in every Put we issue on the main cluster.
> >>> >
> >>> > Data is being replicated but we recently caught a number of data
> items
> >>> that
> >>> > have disappeared in the main cluster as expected based on their TTL
> but
> >>> not
> >>> > in the replica. Both HBase clusters have hfile.format.version set to
> 3
> >>> so
> >>> > TTL tags should be obeyed. I haven't checked yet whether it is a
> case of
> >>> > the replica not obeying ANY TTL's or just some.
> >>> >
> >>> > Before we dig deeper, I was hoping someone in the community would
> point
> >>> it
> >>> > out if we have missed any obvious gotchas.
> >>> >
> >>> > Thanks.
> >>> >
> >>> > ---
> >>> > Saad
> >>>
> >>
> >>
>


Re: HBase 1.0 Per Put TTL Not Being Obeyed On Replication

2017-04-30 Thread Saad Mufti
Is there any facility to check what tags are on a Cell from a client side
program? I started writing some Java code to look at the tags on a Cell
retrieved via a simple Get, but then started reading around and it seems
tags are not even returned (not returned at all or only in certain cases,
I'm not clear) to client side code. So how can I verify that a Cell in one
cluster has the TTL tag whereas the same replicated C3ell in the next
cluster does or doesn't?

Thanks.


Saad


On Fri, Apr 28, 2017 at 1:06 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Thanks for the feedback, I have confirmed that in both the main and
> replica cluster, hbase.replication.rpc.codec is set to:
>
> org.apache.hadoop.hbase.codec.KeyValueCodecWithTags
>
> I have also run a couple of tests and it looks like the TTL is not being
> obeyed on the replica for any entry. Almost as if the TTL cell tags are not
> being replicated. I couldn't find any significant clock skew. If it
> matters, the HBase version on both sides is 1.0.0-cdh5.5.2
>
> Any ideas?
>
> Thanks.
>
> 
> Saad
>
>
> On Thu, Apr 27, 2017 at 3:24 AM, Anoop John <anoop.hb...@gmail.com> wrote:
>
>> Ya can u check whether the replica cluster is NOT removing ANY of the
>> TTL expired cells (as per ur expectation from master cluster) or some.
>> Is there too much clock time skew for the source RS and peer cluster
>> RS? Just check.
>>
>> BTW can u see what is the hbase.replication.rpc.codec configuration
>> value in both clusters?
>>
>> -Anoop-
>>
>> On Thu, Apr 27, 2017 at 2:08 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
>> > Hi,
>> >
>> > I have a main HBase 1.x cluster and some of the tables are being
>> replicated
>> > to a separate HBase cluster of the same version, and the table schemas
>> are
>> > identical. The column family being used has TTL set to "FOREVER", but
>> we do
>> > a per put TTL in every Put we issue on the main cluster.
>> >
>> > Data is being replicated but we recently caught a number of data items
>> that
>> > have disappeared in the main cluster as expected based on their TTL but
>> not
>> > in the replica. Both HBase clusters have hfile.format.version set to 3
>> so
>> > TTL tags should be obeyed. I haven't checked yet whether it is a case of
>> > the replica not obeying ANY TTL's or just some.
>> >
>> > Before we dig deeper, I was hoping someone in the community would point
>> it
>> > out if we have missed any obvious gotchas.
>> >
>> > Thanks.
>> >
>> > ---
>> > Saad
>>
>
>


Re: HBase 1.0 Per Put TTL Not Being Obeyed On Replication

2017-04-28 Thread Saad Mufti
Thanks for the feedback, I have confirmed that in both the main and replica
cluster, hbase.replication.rpc.codec is set to:

org.apache.hadoop.hbase.codec.KeyValueCodecWithTags

I have also run a couple of tests and it looks like the TTL is not being
obeyed on the replica for any entry. Almost as if the TTL cell tags are not
being replicated. I couldn't find any significant clock skew. If it
matters, the HBase version on both sides is 1.0.0-cdh5.5.2

Any ideas?

Thanks.


Saad


On Thu, Apr 27, 2017 at 3:24 AM, Anoop John <anoop.hb...@gmail.com> wrote:

> Ya can u check whether the replica cluster is NOT removing ANY of the
> TTL expired cells (as per ur expectation from master cluster) or some.
> Is there too much clock time skew for the source RS and peer cluster
> RS? Just check.
>
> BTW can u see what is the hbase.replication.rpc.codec configuration
> value in both clusters?
>
> -Anoop-
>
> On Thu, Apr 27, 2017 at 2:08 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> > Hi,
> >
> > I have a main HBase 1.x cluster and some of the tables are being
> replicated
> > to a separate HBase cluster of the same version, and the table schemas
> are
> > identical. The column family being used has TTL set to "FOREVER", but we
> do
> > a per put TTL in every Put we issue on the main cluster.
> >
> > Data is being replicated but we recently caught a number of data items
> that
> > have disappeared in the main cluster as expected based on their TTL but
> not
> > in the replica. Both HBase clusters have hfile.format.version set to 3 so
> > TTL tags should be obeyed. I haven't checked yet whether it is a case of
> > the replica not obeying ANY TTL's or just some.
> >
> > Before we dig deeper, I was hoping someone in the community would point
> it
> > out if we have missed any obvious gotchas.
> >
> > Thanks.
> >
> > ---
> > Saad
>


HBase 1.0 Per Put TTL Not Being Obeyed On Replication

2017-04-26 Thread Saad Mufti
Hi,

I have a main HBase 1.x cluster and some of the tables are being replicated
to a separate HBase cluster of the same version, and the table schemas are
identical. The column family being used has TTL set to "FOREVER", but we do
a per put TTL in every Put we issue on the main cluster.

Data is being replicated but we recently caught a number of data items that
have disappeared in the main cluster as expected based on their TTL but not
in the replica. Both HBase clusters have hfile.format.version set to 3 so
TTL tags should be obeyed. I haven't checked yet whether it is a case of
the replica not obeying ANY TTL's or just some.

Before we dig deeper, I was hoping someone in the community would point it
out if we have missed any obvious gotchas.

Thanks.

---
Saad


Re: Region Server Hotspot/CPU Problem

2017-03-01 Thread Saad Mufti
Someone in our team found this:

http://community.cloudera.com/t5/Storage-Random-Access-HDFS/CPU-Usage-high-when-using-G1GC/td-p/48101

Looks like we're bitten by this bug. Unfortunately this is only fixed in
HBase 1.4.0 so we'll have to undertake a version upgrade which is not
trivial.

-
Saad


On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni <sbpothin...@gmail.com
> wrote:

> First obvious thing to check is "major compaction" happening at the same
> time when it goes to 100% CPU?
> See this helps:
> https://community.hortonworks.com/articles/52616/hbase-
> compaction-tuning-tips.html
>
>
>
> Sent from my iPhone
>
> > On Mar 1, 2017, at 6:06 AM, Saad Mufti <saad.mu...@teamaol.com> wrote:
> >
> > Hi,
> >
> > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase
> > is heavy and a mix of reads and writes. For a few months we have had a
> > problem where occasionally (once a day or more) one of the region servers
> > starts consuming close to 100% CPU. This causes all the client thread
> pool
> > to get filled up serving the slow region server, causing overall response
> > times to slow to a crawl and many calls either start timing out right in
> > the client, or at a higher level.
> >
> > We have done lots of analysis and looked at various metrics but could
> never
> > pin it down to any particular kind of traffic or specific "hot keys".
> > Looking at region server logs has not resulted in any findings. The only
> > sort of vague evidence we have is that from the reported metrics, reads
> per
> > second on the hot server looks more than the other but not in a steady
> > state but in a spiky but steady fashion, but gets per second looks no
> > different than any other server.
> >
> > Until now our hacky way that we discovered to get around this was to just
> > restart the region server. This works because while some calls error out
> > while the regions are in transition, this is a batch oriented system
> with a
> > retry strategy built in.
> >
> > But just yesterday we discovered something interesting, if we connect to
> > the region server in VisualVM and press the "Perform GC" button, there
> > seems to be a brief pause and then CPU settles down back to normal. This
> is
> > despite the fact that memory appears to be under no pressure and before
> we
> > do this, VisualVM indicates very low percentage of CPU time spent in GC,
> so
> > we're baffled, and hoping someone with deeper insight into the HBase code
> > could explain this behavior.
> >
> > Our region server processes are configured with 32GB of RAM and the
> > following GC related JVM settings :
> >
> > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC
> > -XX:MaxGCPauseMillis=100
> > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14
> > -XX:InitiatingHeapOccupancyPercent=70
> >
> > Any insight anyone can provide would be most appreciated.
> >
> > 
> > Saad
>


Re: Hot Region Server With No Hot Region

2016-12-13 Thread Saad Mufti
Thanks everyone for the feedback. We tracked this down to having a bad
design using dynamic columns, there were a few (very few) rows that
accumulated up to 200,000 dynamic columns. When we got any activity that
caused us to try to read one of these rows, it resulted in a hot region
server.

Follow up question, we are now in the process of cleaning up those rows as
identified, but but some are so big that trying to read them in the cleanup
process kills it with out of memory exceptions. Is there any way to
identify rows with too many columns without actually reading them all?

Thanks.


Saad


On Sat, Dec 3, 2016 at 3:20 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> I took a look at the stack trace.
>
> Region server log would give us more detail on the frequency and duration
> of compactions.
>
> Cheers
>
> On Sat, Dec 3, 2016 at 7:39 AM, Jeremy Carroll <phobos...@gmail.com>
> wrote:
>
> > I would check compaction, investigate throttling if it's causing high
> CPU.
> >
> > On Sat, Dec 3, 2016 at 6:20 AM Saad Mufti <saad.mu...@gmail.com> wrote:
> >
> > > No.
> > >
> > > 
> > > Saad
> > >
> > >
> > > On Fri, Dec 2, 2016 at 3:27 PM, Ted Yu <ted...@yahoo.com.invalid>
> wrote:
> > >
> > > > Some how I couldn't access the pastebin (I am in China now).
> > > > Did the region server showing hotspot host meta ?
> > > > Thanks
> > > >
> > > > On Friday, December 2, 2016 11:53 AM, Saad Mufti <
> > > saad.mu...@gmail.com>
> > > > wrote:
> > > >
> > > >
> > > >  We're in AWS with D2.4xLarge instances. Each instance has 12
> > independent
> > > > spindles/disks from what I can tell.
> > > >
> > > > We have charted get_rate and mutate_rate by host and
> > > >
> > > > a) mutate_rate shows no real outliers
> > > > b) read_rate shows the overall rate on the "hotspot" region server
> is a
> > > bit
> > > > higher than every other server, not severely but enough that it is a
> > bit
> > > > noticeable. But when we chart get_rate on that server by region, no
> one
> > > > region stands out.
> > > >
> > > > get_rate chart by host:
> > > >
> > > > https://snag.gy/hmoiDw.jpg
> > > >
> > > > mutate_rate chart by host:
> > > >
> > > > https://snag.gy/jitdMN.jpg
> > > >
> > > > 
> > > > Saad
> > > >
> > > >
> > > > 
> > > > Saad
> > > >
> > > >
> > > > On Fri, Dec 2, 2016 at 2:34 PM, John Leach <jle...@splicemachine.com
> >
> > > > wrote:
> > > >
> > > > > Here is what I see...
> > > > >
> > > > >
> > > > > * Short Compaction Running on Heap
> > > > > "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> > > > > aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" -
> > > > Thread
> > > > > t@242
> > > > >java.lang.Thread.State: RUNNABLE
> > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > internalEncode(FastDiffDeltaEncoder.java:245)
> > > > >at org.apache.hadoop.hbase.io.encoding.
> BufferedDataBlockEncoder.
> > > > > encode(BufferedDataBlockEncoder.java:987)
> > > > >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > > > > encode(FastDiffDeltaEncoder.java:58)
> > > > >at org.apache.hadoop.hbase.io
> > > .hfile.HFileDataBlockEncoderImpl.encode(
> > > > > HFileDataBlockEncoderImpl.java:97)
> > > > >at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> > > > > HFileBlock.java:866)
> > > > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> > > > > HFileWriterV2.java:270)
> > > > >at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> > > > > HFileWriterV3.java:87)
> > > > >at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> > > > > append(StoreFile.java:949)
> > > > >at org.apache.hadoop.hbase.regionserver.compactions.
> > > > > Compactor

Re: Hot Region Server With No Hot Region

2016-12-03 Thread Saad Mufti
No.


Saad


On Fri, Dec 2, 2016 at 3:27 PM, Ted Yu <ted...@yahoo.com.invalid> wrote:

> Some how I couldn't access the pastebin (I am in China now).
> Did the region server showing hotspot host meta ?
> Thanks
>
> On Friday, December 2, 2016 11:53 AM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
>
>
>  We're in AWS with D2.4xLarge instances. Each instance has 12 independent
> spindles/disks from what I can tell.
>
> We have charted get_rate and mutate_rate by host and
>
> a) mutate_rate shows no real outliers
> b) read_rate shows the overall rate on the "hotspot" region server is a bit
> higher than every other server, not severely but enough that it is a bit
> noticeable. But when we chart get_rate on that server by region, no one
> region stands out.
>
> get_rate chart by host:
>
> https://snag.gy/hmoiDw.jpg
>
> mutate_rate chart by host:
>
> https://snag.gy/jitdMN.jpg
>
> 
> Saad
>
>
> 
> Saad
>
>
> On Fri, Dec 2, 2016 at 2:34 PM, John Leach <jle...@splicemachine.com>
> wrote:
>
> > Here is what I see...
> >
> >
> > * Short Compaction Running on Heap
> > "regionserver/ip-10-99-181-146.aolp-prd.us-east-1.ec2.
> > aolcloud.net/10.99.181.146:60020-shortCompactions-1480229281547" -
> Thread
> > t@242
> >java.lang.Thread.State: RUNNABLE
> >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > compressSingleKeyValue(FastDiffDeltaEncoder.java:270)
> >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > internalEncode(FastDiffDeltaEncoder.java:245)
> >at org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder.
> > encode(BufferedDataBlockEncoder.java:987)
> >at org.apache.hadoop.hbase.io.encoding.FastDiffDeltaEncoder.
> > encode(FastDiffDeltaEncoder.java:58)
> >at org.apache.hadoop.hbase.io.hfile.HFileDataBlockEncoderImpl.encode(
> > HFileDataBlockEncoderImpl.java:97)
> >at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(
> > HFileBlock.java:866)
> >at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(
> > HFileWriterV2.java:270)
> >at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(
> > HFileWriterV3.java:87)
> >at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.
> > append(StoreFile.java:949)
> >at org.apache.hadoop.hbase.regionserver.compactions.
> > Compactor.performCompaction(Compactor.java:282)
> >at org.apache.hadoop.hbase.regionserver.compactions.
> > DefaultCompactor.compact(DefaultCompactor.java:105)
> >at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$
> > DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
> >at org.apache.hadoop.hbase.regionserver.HStore.compact(
> > HStore.java:1233)
> >at org.apache.hadoop.hbase.regionserver.HRegion.compact(
> > HRegion.java:1770)
> >at org.apache.hadoop.hbase.regionserver.CompactSplitThread$
> > CompactionRunner.run(CompactSplitThread.java:520)
> >at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> >at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> >at java.lang.Thread.run(Thread.java:745)
> >
> >
> > * WAL Syncs waiting…  ALL 5
> > "sync.0" - Thread t@202
> >java.lang.Thread.State: TIMED_WAITING
> >at java.lang.Object.wait(Native Method)
> >- waiting on <67ba892d> (a java.util.LinkedList)
> >at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(
> > DFSOutputStream.java:2337)
> >at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(
> > DFSOutputStream.java:2224)
> >at org.apache.hadoop.hdfs.DFSOutputStream.hflush(
> > DFSOutputStream.java:2116)
> >at org.apache.hadoop.fs.FSDataOutputStream.hflush(
> > FSDataOutputStream.java:130)
> >at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(
> > ProtobufLogWriter.java:173)
> >at org.apache.hadoop.hbase.regionserver.wal.FSHLog$
> > SyncRunner.run(FSHLog.java:1379)
> >at java.lang.Thread.run(Thread.java:745)
> >
> > * Mutations backing up very badly...
> >
> > "B.defaultRpcServer.handler=103,queue=7,port=60020" - Thread t@155
> >java.lang.Thread.State: TIMED_WAITING
> >at java.lang.Object.wait(Native Method)
> >- waiting on <6ab54ea3> (a org.apache.hadoop.hbase.
> > regionserver.wal.SyncFuture)
> >at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.
> > get(SyncFuture.java:167)
> >   

Re: Hot Region Server With No Hot Region

2016-12-02 Thread Saad Mufti
:6325)
> at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> mutateRows(RSRpcServices.java:418)
> at org.apache.hadoop.hbase.regionserver.RSRpcServices.
> multi(RSRpcServices.java:1916)
> at org.apache.hadoop.hbase.protobuf.generated.
> ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
> at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Too many writers being blocked attempting to write to WAL.
>
> What does your disk infrastructure look like?  Can you get away with
> Multi-wal?  Ugh...
>
> Regards,
> John Leach
>
>
> > On Dec 2, 2016, at 1:20 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >
> > Hi Ted,
> >
> > Finally we have another hotspot going on, same symptoms as before, here
> is
> > the pastebin for the stack trace from the region server that I obtained
> via
> > VisualVM:
> >
> > http://pastebin.com/qbXPPrXk
> >
> > Would really appreciate any insight you or anyone else can provide.
> >
> > Thanks.
> >
> > 
> > Saad
> >
> >
> > On Thu, Dec 1, 2016 at 6:08 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >
> >> Sure will, the next time it happens.
> >>
> >> Thanks!!!
> >>
> >> 
> >> Saad
> >>
> >>
> >> On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu <ted...@yahoo.com.invalid>
> wrote:
> >>
> >>> From #2 in the initial email, the hbase:meta might not be the cause for
> >>> the hotspot.
> >>>
> >>> Saad:
> >>> Can you pastebin stack trace of the hot region server when this happens
> >>> again ?
> >>>
> >>> Thanks
> >>>
> >>>> On Dec 2, 2016, at 4:48 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >>>>
> >>>> We used a pre-split into 1024 regions at the start but we
> miscalculated
> >>> our
> >>>> data size, so there were still auto-splits storms at the beginning as
> >>> data
> >>>> size stabilized, it has ended up at around 9500 or so regions, plus a
> >>> few
> >>>> thousand regions for a few other tables (much smaller). But haven't
> had
> >>> any
> >>>> new auto-splits in a couple of months. And the hotspots only started
> >>>> happening recently.
> >>>>
> >>>> Our hashing scheme is very simple, we take the MD5 of the key, then
> >>> form a
> >>>> 4 digit prefix based on the first two bytes of the MD5 normalized to
> be
> >>>> within the range 0-1023 . I am fairly confident about this scheme
> >>>> especially since even during the hotspot we see no evidence so far
> that
> >>> any
> >>>> particular region is taking disproportionate traffic (based on
> Cloudera
> >>>> Manager per region charts on the hotspot server). Does that look like
> a
> >>>> reasonable scheme to randomize which region any give key goes to? And
> >>> the
> >>>> start of the hotspot doesn't seem to correspond to any region
> splitting
> >>> or
> >>>> moving from one server to another activity.
> >>>>
> >>>> Thanks.
> >>>>
> >>>> 
> >>>> Saad
> >>>>
> >>>>
> >>>>> On Thu, Dec 1, 2016 at 3:32 PM, John Leach <jle...@splicemachine.com
> >
> >>> wrote:
> >>>>>
> >>>>> Saad,
> >>>>>
> >>>>> Region move or split causes client connections to simultaneously
> >>> refresh
> >>>>> their meta.
> >>>>>
> >>>>> Key word is supposed.  We have seen meta hot spotting from time to
> time
> >>>>> and on different versions at Splice Machine.
> >>>>>
> >>>>> How confident are you in your hashing algorithm?
> >>>>>
> >>>>> Regards,
> >>>>> John Leach
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Dec 1, 2016, at 2:25 PM, Saad Mufti <saad.mu...@gmail.com>
> wrote:
> >>>>>>
> &g

Re: Hot Region Server With No Hot Region

2016-12-02 Thread Saad Mufti
Hi Ted,

Finally we have another hotspot going on, same symptoms as before, here is
the pastebin for the stack trace from the region server that I obtained via
VisualVM:

http://pastebin.com/qbXPPrXk

Would really appreciate any insight you or anyone else can provide.

Thanks.


Saad


On Thu, Dec 1, 2016 at 6:08 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Sure will, the next time it happens.
>
> Thanks!!!
>
> 
> Saad
>
>
> On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu <ted...@yahoo.com.invalid> wrote:
>
>> From #2 in the initial email, the hbase:meta might not be the cause for
>> the hotspot.
>>
>> Saad:
>> Can you pastebin stack trace of the hot region server when this happens
>> again ?
>>
>> Thanks
>>
>> > On Dec 2, 2016, at 4:48 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
>> >
>> > We used a pre-split into 1024 regions at the start but we miscalculated
>> our
>> > data size, so there were still auto-splits storms at the beginning as
>> data
>> > size stabilized, it has ended up at around 9500 or so regions, plus a
>> few
>> > thousand regions for a few other tables (much smaller). But haven't had
>> any
>> > new auto-splits in a couple of months. And the hotspots only started
>> > happening recently.
>> >
>> > Our hashing scheme is very simple, we take the MD5 of the key, then
>> form a
>> > 4 digit prefix based on the first two bytes of the MD5 normalized to be
>> > within the range 0-1023 . I am fairly confident about this scheme
>> > especially since even during the hotspot we see no evidence so far that
>> any
>> > particular region is taking disproportionate traffic (based on Cloudera
>> > Manager per region charts on the hotspot server). Does that look like a
>> > reasonable scheme to randomize which region any give key goes to? And
>> the
>> > start of the hotspot doesn't seem to correspond to any region splitting
>> or
>> > moving from one server to another activity.
>> >
>> > Thanks.
>> >
>> > 
>> > Saad
>> >
>> >
>> >> On Thu, Dec 1, 2016 at 3:32 PM, John Leach <jle...@splicemachine.com>
>> wrote:
>> >>
>> >> Saad,
>> >>
>> >> Region move or split causes client connections to simultaneously
>> refresh
>> >> their meta.
>> >>
>> >> Key word is supposed.  We have seen meta hot spotting from time to time
>> >> and on different versions at Splice Machine.
>> >>
>> >> How confident are you in your hashing algorithm?
>> >>
>> >> Regards,
>> >> John Leach
>> >>
>> >>
>> >>
>> >>> On Dec 1, 2016, at 2:25 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>> >>>
>> >>> No never thought about that. I just figured out how to locate the
>> server
>> >>> for that table after you mentioned it. We'll have to keep an eye on it
>> >> next
>> >>> time we have a hotspot to see if it coincides with the hotspot server.
>> >>>
>> >>> What would be the theory for how it could become a hotspot? Isn't the
>> >>> client supposed to cache it and only go back for a refresh if it hits
>> a
>> >>> region that is not in its expected location?
>> >>>
>> >>> 
>> >>> Saad
>> >>>
>> >>>
>> >>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach <jle...@splicemachine.com>
>> >> wrote:
>> >>>
>> >>>> Saad,
>> >>>>
>> >>>> Did you validate that Meta is not on the “Hot” region server?
>> >>>>
>> >>>> Regards,
>> >>>> John Leach
>> >>>>
>> >>>>
>> >>>>
>> >>>>> On Dec 1, 2016, at 1:50 PM, Saad Mufti <saad.mu...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to
>> avoid
>> >>>>> hotspotting due to inadvertent data patterns by prepending an MD5
>> >> based 4
>> >>>>> digit hash prefix to all our data keys. This works fine most of the
>> >>>> times,
>> >>>>> but more and more (as much as once or twice a day) recently 

Re: Creating HBase table with presplits

2016-12-02 Thread Saad Mufti
Forgot to mention in above example you would presplit into 1024 regions,
starting from "" to "1023" (start keys).

Cheers.


Saad


On Fri, Dec 2, 2016 at 8:47 AM, Saad Mufti <saad.mu...@gmail.com> wrote:

> One way to do this without knowing your data (still need some idea of size
> of keyspace) is to prepend a fixed numeric prefix from a suitable range
> based on a good hash like MD5. For example, let us say you can predict your
> data will fit in about 1024 regions. You can decide to prepend a prefix
> from  to 1024 to all you keys based on a suitable hash.
>
> The pros:
>
> 1. you get to pre-split without knowing your keyspace
> 2. very hard if not impossible for unknown data providers to send you data
> in some order that generates hotspots (unless of course the same key is
> repeated over and over, still have to watch out for that)
>
> The cons:
>
> 1. lose the ability to do scan in "natural" sorted order of your keyspace
> as that order is not preserved anymore in HBase
> 2. if you miscalculate your keyspace size by a lot, you are stuck with the
> hash function and range you selected even if you later get more regions
> unless you're willing to do complete migration to a new table
>
> Hope above helps.
>
> 
> Saad
>
>
> On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain <sachinjain...@gmail.com>
> wrote:
>
>> Thanks Dave for your suggestions!
>> Will let you know if I find some approach to tackle this situation.
>>
>> Regards
>>
>> On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham <lat...@davelink.net> wrote:
>>
>> > If you truly have no way to predict anything about the distribution of
>> your
>> > data across the row key space, then you are correct that there is no
>> way to
>> > presplit your regions in an effective way.  Either you need to make some
>> > starting guess, such as a small number of uniform splits, or wait until
>> you
>> > have some information about what the data will look like.
>> >
>> > Dave
>> >
>> > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain <sachinjain...@gmail.com>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > I was going though pre-splitting a table article [0] and it is
>> mentioned
>> > > that it is generally best practice to presplit your table. But don't
>> we
>> > > need to know the data in advance in order to presplit it.
>> > >
>> > > Question: What should be the best practice when we don't know what
>> data
>> > is
>> > > going to be inserted into HBase. Essentially I don't know the key
>> range
>> > so
>> > > if I specify wrong splits, then either first or last split can be a
>> hot
>> > > region in my system.
>> > >
>> > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits
>> > >
>> > > Thanks
>> > > -Sachin
>> > >
>> >
>>
>
>


Re: Creating HBase table with presplits

2016-12-02 Thread Saad Mufti
One way to do this without knowing your data (still need some idea of size
of keyspace) is to prepend a fixed numeric prefix from a suitable range
based on a good hash like MD5. For example, let us say you can predict your
data will fit in about 1024 regions. You can decide to prepend a prefix
from  to 1024 to all you keys based on a suitable hash.

The pros:

1. you get to pre-split without knowing your keyspace
2. very hard if not impossible for unknown data providers to send you data
in some order that generates hotspots (unless of course the same key is
repeated over and over, still have to watch out for that)

The cons:

1. lose the ability to do scan in "natural" sorted order of your keyspace
as that order is not preserved anymore in HBase
2. if you miscalculate your keyspace size by a lot, you are stuck with the
hash function and range you selected even if you later get more regions
unless you're willing to do complete migration to a new table

Hope above helps.


Saad


On Tue, Nov 29, 2016 at 4:28 AM, Sachin Jain 
wrote:

> Thanks Dave for your suggestions!
> Will let you know if I find some approach to tackle this situation.
>
> Regards
>
> On Mon, Nov 28, 2016 at 9:05 PM, Dave Latham  wrote:
>
> > If you truly have no way to predict anything about the distribution of
> your
> > data across the row key space, then you are correct that there is no way
> to
> > presplit your regions in an effective way.  Either you need to make some
> > starting guess, such as a small number of uniform splits, or wait until
> you
> > have some information about what the data will look like.
> >
> > Dave
> >
> > On Mon, Nov 28, 2016 at 12:42 AM, Sachin Jain 
> > wrote:
> >
> > > Hi,
> > >
> > > I was going though pre-splitting a table article [0] and it is
> mentioned
> > > that it is generally best practice to presplit your table. But don't we
> > > need to know the data in advance in order to presplit it.
> > >
> > > Question: What should be the best practice when we don't know what data
> > is
> > > going to be inserted into HBase. Essentially I don't know the key range
> > so
> > > if I specify wrong splits, then either first or last split can be a hot
> > > region in my system.
> > >
> > > [0]: https://hbase.apache.org/book.html#rowkey.regionsplits
> > >
> > > Thanks
> > > -Sachin
> > >
> >
>


Re: Hot Region Server With No Hot Region

2016-12-01 Thread Saad Mufti
Sure will, the next time it happens.

Thanks!!!


Saad


On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu <ted...@yahoo.com.invalid> wrote:

> From #2 in the initial email, the hbase:meta might not be the cause for
> the hotspot.
>
> Saad:
> Can you pastebin stack trace of the hot region server when this happens
> again ?
>
> Thanks
>
> > On Dec 2, 2016, at 4:48 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >
> > We used a pre-split into 1024 regions at the start but we miscalculated
> our
> > data size, so there were still auto-splits storms at the beginning as
> data
> > size stabilized, it has ended up at around 9500 or so regions, plus a few
> > thousand regions for a few other tables (much smaller). But haven't had
> any
> > new auto-splits in a couple of months. And the hotspots only started
> > happening recently.
> >
> > Our hashing scheme is very simple, we take the MD5 of the key, then form
> a
> > 4 digit prefix based on the first two bytes of the MD5 normalized to be
> > within the range 0-1023 . I am fairly confident about this scheme
> > especially since even during the hotspot we see no evidence so far that
> any
> > particular region is taking disproportionate traffic (based on Cloudera
> > Manager per region charts on the hotspot server). Does that look like a
> > reasonable scheme to randomize which region any give key goes to? And the
> > start of the hotspot doesn't seem to correspond to any region splitting
> or
> > moving from one server to another activity.
> >
> > Thanks.
> >
> > 
> > Saad
> >
> >
> >> On Thu, Dec 1, 2016 at 3:32 PM, John Leach <jle...@splicemachine.com>
> wrote:
> >>
> >> Saad,
> >>
> >> Region move or split causes client connections to simultaneously refresh
> >> their meta.
> >>
> >> Key word is supposed.  We have seen meta hot spotting from time to time
> >> and on different versions at Splice Machine.
> >>
> >> How confident are you in your hashing algorithm?
> >>
> >> Regards,
> >> John Leach
> >>
> >>
> >>
> >>> On Dec 1, 2016, at 2:25 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >>>
> >>> No never thought about that. I just figured out how to locate the
> server
> >>> for that table after you mentioned it. We'll have to keep an eye on it
> >> next
> >>> time we have a hotspot to see if it coincides with the hotspot server.
> >>>
> >>> What would be the theory for how it could become a hotspot? Isn't the
> >>> client supposed to cache it and only go back for a refresh if it hits a
> >>> region that is not in its expected location?
> >>>
> >>> 
> >>> Saad
> >>>
> >>>
> >>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach <jle...@splicemachine.com>
> >> wrote:
> >>>
> >>>> Saad,
> >>>>
> >>>> Did you validate that Meta is not on the “Hot” region server?
> >>>>
> >>>> Regards,
> >>>> John Leach
> >>>>
> >>>>
> >>>>
> >>>>> On Dec 1, 2016, at 1:50 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to
> avoid
> >>>>> hotspotting due to inadvertent data patterns by prepending an MD5
> >> based 4
> >>>>> digit hash prefix to all our data keys. This works fine most of the
> >>>> times,
> >>>>> but more and more (as much as once or twice a day) recently we have
> >>>>> occasions where one region server suddenly becomes "hot" (CPU above
> or
> >>>>> around 95% in various monitoring tools). When it happens it lasts for
> >>>>> hours, occasionally the hotspot might jump to another region server
> as
> >>>> the
> >>>>> master decide the region is unresponsive and gives its region to
> >> another
> >>>>> server.
> >>>>>
> >>>>> For the longest time, we thought this must be some single rogue key
> in
> >>>> our
> >>>>> input data that is being hammered. All attempts to track this down
> have
> >>>>> failed though, and the following behavior argues against this b

Re: Hot Region Server With No Hot Region

2016-12-01 Thread Saad Mufti
We used a pre-split into 1024 regions at the start but we miscalculated our
data size, so there were still auto-splits storms at the beginning as data
size stabilized, it has ended up at around 9500 or so regions, plus a few
thousand regions for a few other tables (much smaller). But haven't had any
new auto-splits in a couple of months. And the hotspots only started
happening recently.

Our hashing scheme is very simple, we take the MD5 of the key, then form a
4 digit prefix based on the first two bytes of the MD5 normalized to be
within the range 0-1023 . I am fairly confident about this scheme
especially since even during the hotspot we see no evidence so far that any
particular region is taking disproportionate traffic (based on Cloudera
Manager per region charts on the hotspot server). Does that look like a
reasonable scheme to randomize which region any give key goes to? And the
start of the hotspot doesn't seem to correspond to any region splitting or
moving from one server to another activity.

Thanks.


Saad


On Thu, Dec 1, 2016 at 3:32 PM, John Leach <jle...@splicemachine.com> wrote:

> Saad,
>
> Region move or split causes client connections to simultaneously refresh
> their meta.
>
> Key word is supposed.  We have seen meta hot spotting from time to time
> and on different versions at Splice Machine.
>
> How confident are you in your hashing algorithm?
>
> Regards,
> John Leach
>
>
>
> > On Dec 1, 2016, at 2:25 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >
> > No never thought about that. I just figured out how to locate the server
> > for that table after you mentioned it. We'll have to keep an eye on it
> next
> > time we have a hotspot to see if it coincides with the hotspot server.
> >
> > What would be the theory for how it could become a hotspot? Isn't the
> > client supposed to cache it and only go back for a refresh if it hits a
> > region that is not in its expected location?
> >
> > 
> > Saad
> >
> >
> > On Thu, Dec 1, 2016 at 2:56 PM, John Leach <jle...@splicemachine.com>
> wrote:
> >
> >> Saad,
> >>
> >> Did you validate that Meta is not on the “Hot” region server?
> >>
> >> Regards,
> >> John Leach
> >>
> >>
> >>
> >>> On Dec 1, 2016, at 1:50 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid
> >>> hotspotting due to inadvertent data patterns by prepending an MD5
> based 4
> >>> digit hash prefix to all our data keys. This works fine most of the
> >> times,
> >>> but more and more (as much as once or twice a day) recently we have
> >>> occasions where one region server suddenly becomes "hot" (CPU above or
> >>> around 95% in various monitoring tools). When it happens it lasts for
> >>> hours, occasionally the hotspot might jump to another region server as
> >> the
> >>> master decide the region is unresponsive and gives its region to
> another
> >>> server.
> >>>
> >>> For the longest time, we thought this must be some single rogue key in
> >> our
> >>> input data that is being hammered. All attempts to track this down have
> >>> failed though, and the following behavior argues against this being
> >>> application based:
> >>>
> >>> 1. plotted Get and Put rate by region on the "hot" region server in
> >>> Cloudera Manager Charts, shows no single region is an outlier.
> >>>
> >>> 2. cleanly restarting just the region server process causes its regions
> >> to
> >>> randomly migrate to other region servers, then it gets new ones from
> the
> >>> HBase master, basically a sort of shuffling, then the hotspot goes
> away.
> >> If
> >>> it were application based, you'd expect the hotspot to just jump to
> >> another
> >>> region server.
> >>>
> >>> 3. have pored through region server logs and can't see anything out of
> >> the
> >>> ordinary happening
> >>>
> >>> The only other pertinent thing to mention might be that we have a
> special
> >>> process of our own running outside the cluster that does cluster wide
> >> major
> >>> compaction in a rolling fashion, where each batch consists of one
> region
> >>> from each region server, and it waits before one batch is completely
> done
> >>> before starting another. We have seen no real impact on the hotspot
> from
> >>> shutting this down and in normal times it doesn't impact our read or
> >> write
> >>> performance much.
> >>>
> >>> We are at our wit's end, anyone have experience with a scenario like
> >> this?
> >>> Any help/guidance would be most appreciated.
> >>>
> >>> -
> >>> Saad
> >>
> >>
>
>


Re: Using Hbase as a transactional table

2016-12-01 Thread Saad Mufti
FWIW, in my company (AOL) we discovered a small elegant all client side
transaction library on top of HBase, originally written by a Korea based
team, called Haeinsa. It doesn't look active anymore so we had to fork it
and have done a couple of minor enhancements and one bugfix, but has been
working perfect for us. See the original repo at:

https://github.com/VCNC/haeinsa

since never got any responses to issues we reported, we had to maintain our
own fork, see AOL's fork at:

https://github.com/aol/haeinsa

It has listed on the front github page a great slideshare presentation that
explains nicely how the whole thing works, through just adding an extra
column family per table for the lock column.

All that said, keep in mind that any such library of course has a
performance hit. We like this library because:

1) we can use it with no extra HBase server side changes (other than extra
column family) or other kind of server to install
2) we can be selective about which columns or column families to use
transactionally and which to use "raw" for performance reasons

Would be great if we could submit our contributions back to the original
project but like I said we never heard back from them. Feel free to write
back to let us know what you think.


Saad


On Mon, Nov 28, 2016 at 6:01 PM, John Leach 
wrote:

> Mich,
>
> Splice Machine (Open Source) can do this on top of Hbase and we have an
> example running a TPC-C benchmark.  Might be worth a look.
>
> Regards,
> John
>
> > On Nov 28, 2016, at 4:36 PM, Ted Yu  wrote:
> >
> > Not sure if Transactions (beta) | Apache Phoenix is up to date.
> > Why not ask on Phoenix mailing list where you would get better answer(s)
> ?
> > Cheers
> >
> > |
> > |   |
> > Transactions (beta) | Apache Phoenix
> >   |  |
> >
> >  |
> >
> >
> >
> >
> >On Monday, November 28, 2016 2:02 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
> >
> >
> > Thanks Ted.
> >
> > How does Phoenix provide transaction support?
> >
> > I have read some docs but sounds like problematic. I need to be sure
> there
> > is full commit and rollback if things go wrong!
> >
> > Also it appears that Phoenix transactional support is in beta phase.
> >
> > Cheers
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >  OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 23 November 2016 at 18:15, Ted Yu  wrote:
> >
> >> Mich:
> >> Even though related rows are on the same region server, there is no
> >> intrinsic transaction support.
> >>
> >> For #1 under design considerations, multi column family is one
> >> possibility. You should consider how the queries from RDBMS access the
> >> related data.
> >>
> >> You can also evaluate Phoenix / Trafodion which provides transaction
> >> support.
> >>
> >> Cheers
> >>
> >>> On Nov 23, 2016, at 9:19 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> >> wrote:
> >>>
> >>> Thanks all.
> >>>
> >>> As I understand Hbase does not support ACIC compliant transactions over
> >>> multiple rows or across tables?
> >>>
> >>> So this is not supported
> >>>
> >>>
> >>>   1. Hbase can support multi-rows transactions if the rows are on the
> >> same
> >>>   table and in the same RegionServer?
> >>>   2. Hbase does not support multi-rows transactions if the rows are in
> >>>   different tables but happen to be in the same RegionServer?
> >>>   3. If I migrated RDBMS transactional tables to the same Hbase table
> >> (big
> >>>   if) with different column familities will that work?
> >>>
> >>>
> >>> Design considerations
> >>>
> >>>
> >>>   1. If I have 4 big tables in RDBMS, some having in excess of 200
> >> columns
> >>>   (I know this is a joke), can they all go one-to-one to Hbase tables.
> >> Can
> >>>   some of these RDBMS tables put into one Hbase schema  with different
> >> column
> >>>   families.
> >>>   2. then another question. If I use hive tables on these hbase tables
> >>>   with large number of family columns, will it work ok?
> >>>
> >>> thanks
> >>>
> >>>   1.
> >>>
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn * https://www.linkedin.com/profile/view?id=
> >> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>  AAEWh2gBxianrbJd6zP6AcPCCd
> >> OABUrV8Pw>*
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>> *Disclaimer:* Use it at your own risk. 

Re: Hot Region Server With No Hot Region

2016-12-01 Thread Saad Mufti
No never thought about that. I just figured out how to locate the server
for that table after you mentioned it. We'll have to keep an eye on it next
time we have a hotspot to see if it coincides with the hotspot server.

What would be the theory for how it could become a hotspot? Isn't the
client supposed to cache it and only go back for a refresh if it hits a
region that is not in its expected location?


Saad


On Thu, Dec 1, 2016 at 2:56 PM, John Leach <jle...@splicemachine.com> wrote:

> Saad,
>
> Did you validate that Meta is not on the “Hot” region server?
>
> Regards,
> John Leach
>
>
>
> > On Dec 1, 2016, at 1:50 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
> >
> > Hi,
> >
> > We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid
> > hotspotting due to inadvertent data patterns by prepending an MD5 based 4
> > digit hash prefix to all our data keys. This works fine most of the
> times,
> > but more and more (as much as once or twice a day) recently we have
> > occasions where one region server suddenly becomes "hot" (CPU above or
> > around 95% in various monitoring tools). When it happens it lasts for
> > hours, occasionally the hotspot might jump to another region server as
> the
> > master decide the region is unresponsive and gives its region to another
> > server.
> >
> > For the longest time, we thought this must be some single rogue key in
> our
> > input data that is being hammered. All attempts to track this down have
> > failed though, and the following behavior argues against this being
> > application based:
> >
> > 1. plotted Get and Put rate by region on the "hot" region server in
> > Cloudera Manager Charts, shows no single region is an outlier.
> >
> > 2. cleanly restarting just the region server process causes its regions
> to
> > randomly migrate to other region servers, then it gets new ones from the
> > HBase master, basically a sort of shuffling, then the hotspot goes away.
> If
> > it were application based, you'd expect the hotspot to just jump to
> another
> > region server.
> >
> > 3. have pored through region server logs and can't see anything out of
> the
> > ordinary happening
> >
> > The only other pertinent thing to mention might be that we have a special
> > process of our own running outside the cluster that does cluster wide
> major
> > compaction in a rolling fashion, where each batch consists of one region
> > from each region server, and it waits before one batch is completely done
> > before starting another. We have seen no real impact on the hotspot from
> > shutting this down and in normal times it doesn't impact our read or
> write
> > performance much.
> >
> > We are at our wit's end, anyone have experience with a scenario like
> this?
> > Any help/guidance would be most appreciated.
> >
> > -
> > Saad
>
>


Hot Region Server With No Hot Region

2016-12-01 Thread Saad Mufti
Hi,

We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid
hotspotting due to inadvertent data patterns by prepending an MD5 based 4
digit hash prefix to all our data keys. This works fine most of the times,
but more and more (as much as once or twice a day) recently we have
occasions where one region server suddenly becomes "hot" (CPU above or
around 95% in various monitoring tools). When it happens it lasts for
hours, occasionally the hotspot might jump to another region server as the
master decide the region is unresponsive and gives its region to another
server.

For the longest time, we thought this must be some single rogue key in our
input data that is being hammered. All attempts to track this down have
failed though, and the following behavior argues against this being
application based:

1. plotted Get and Put rate by region on the "hot" region server in
Cloudera Manager Charts, shows no single region is an outlier.

2. cleanly restarting just the region server process causes its regions to
randomly migrate to other region servers, then it gets new ones from the
HBase master, basically a sort of shuffling, then the hotspot goes away. If
it were application based, you'd expect the hotspot to just jump to another
region server.

3. have pored through region server logs and can't see anything out of the
ordinary happening

The only other pertinent thing to mention might be that we have a special
process of our own running outside the cluster that does cluster wide major
compaction in a rolling fashion, where each batch consists of one region
from each region server, and it waits before one batch is completely done
before starting another. We have seen no real impact on the hotspot from
shutting this down and in normal times it doesn't impact our read or write
performance much.

We are at our wit's end, anyone have experience with a scenario like this?
Any help/guidance would be most appreciated.

-
Saad


Does Replication Affect Write Performance On The Main Cluster

2016-07-15 Thread Saad Mufti
Hi,

Don't have anything conclusive but I have seen some correlation where in
very high write rate situation, the write rate can increase when major
compaction or some other high CPU/network activity (for example we run some
Spark jobs on our replica HBase cluster) stops happening on the replica
cluster. Basically looks like the rate is lower when the replica cluster is
busy with some other activity.

I wanted to know if this totally anecdotal evidence is something I should
look into or whether it is something well known in the community.

Thanks in advance for any pointer or advice for someone relatively new to
HBase.


Saad


Re: HBase number of columns

2016-06-16 Thread Saad Mufti
There is no real column schema in HBase other than defining the column
family, each write to a column writes a cell with the column name plus
value, so in theory number of columns doesn't really matter. What matters
is how much data you read and write.

That said there are settings in the column family schema for
DATA_BLOCK_ENCODING
that affect how much actual space each column/cell takes, FAST_DIFF is a
decent choice to make sure there is not too much redundancy by writing the
same column name over and over again if lots of rows have the same column
name. There are also compression settings of course.

Hope that helps.


Saad


On Wed, Jun 15, 2016 at 7:11 AM, Siddharth Ubale <
siddharth.ub...@syncoms.com> wrote:

> Hi,
>
> As per the official documentation of HBase it is mentioned that HBase
> typical schema should contain 1 to 3 column families per table (
> https://hbase.apache.org/book.html#table_schema_rules_of_thumb ) .
> However there is no mention of how many column qualifiers should a row
> contain for each column family to see good read & write performance.
> Could anybody let us know their input on how many columns per row is
> desirable in HBase or how many column qualifiers per column family would be
> desirable.
> Thanks,
> Siddharth Ubale,
>
>


Cell Level TTL And hfile.format.version

2016-05-06 Thread Saad Mufti
HI,

We're running a  CDH 5.5.2 HBase cluster (HBase Version 1.0.0-cdh5.5.2,
revision=Unknown). We are using the per-cell TTL feature (Mutation.setTTL)


As I learn more about and read up on HBase, I realized that in our HBase
config hfile.format.version was set to 2 (the default, we haven't touched
this config yet), and from what I read that version of the HFile format
does NOT support cell tags which are needed for cell TTL which use tags.

Of course I am in the process of writing a test to check whether our
production db is indeed getting filled with cells that should in actuality
be expired given their TTL value.

We haven't seen any errors at runtime, does this mean our efforts to set a
TTL are being silently ignored? Isn't this bad behavior? Even if the
hfile.format.version is set to the wrong version, wouldn't it be better to
throw an error instead of just silently dropping any tags that are set?

Thanks.

-
Saad


Re: Major Compaction Strategy

2016-04-29 Thread Saad Mufti
It is only because we prepend a good quality hash to our incoming keys to
get more even distribution and avoid hotspots, plus huge amounts of traffic.


Saad


On Fri, Apr 29, 2016 at 6:52 PM, Frank Luo <j...@merkleinc.com> wrote:

> I have to say you have an extremely good design to have all your 7000
> regions hot at all time.
>
> In my world, due to nature of data, we always have 10 to 20% of regions
> being idle for a bit of time.
>
> To be precise, the program gets R/W counts on a region, wait for one
> minutes, then gets the counts again. If unchanged, then the region is
> considered idle.
>
> -Original Message-
> From: Saad Mufti [mailto:saad.mu...@gmail.com]
> Sent: Friday, April 29, 2016 5:37 PM
> To: user@hbase.apache.org
> Subject: Re: Major Compaction Strategy
>
> Unfortunately all our tables and regions are active 24/7.  Traffic does
> fall some at night but there is no real downtime.
>
> It is not user facing load though so we could I guess turn off traffic for
> a while as data queues up in Kafka. But not too long as then we're playing
> catch up.
>
> 
> Saad
>
> On Friday, April 29, 2016, Frank Luo <j...@merkleinc.com> wrote:
>
> > Saad,
> >
> > Will all your tables/regions be used 24/7, or at any time, just a part
> > of regions used and others are running ideal?
> >
> > If latter, I developed a tool to launch major-compact in a "smart"
> > way, because I am facing a similar issue.
> > https://github.com/jinyeluo/smarthbasecompactor.
> >
> > It looks at every RegionServer, and find non-hot regions with most
> > store files and starts compacting. It just continue going until time
> > is up. Just to be clear, it doesn't perform MC itself, which is a
> > scary thing to do, but tell region servers to do MC.
> >
> > We have it running in our cluster for about 10 hours a day and it has
> > virtually no impact to applications and the cluster is doing far
> > better than when using default scheduled MC.
> >
> >
> > -Original Message-
> > From: Saad Mufti [mailto:saad.mu...@gmail.com <javascript:;>]
> > Sent: Friday, April 29, 2016 1:51 PM
> > To: user@hbase.apache.org <javascript:;>
> > Subject: Re: Major Compaction Strategy
> >
> > We have more issues now, after testing this in dev, in our production
> > cluster which has tons of data (60 regions servers and around 7000
> > regions), we tried to do rolling compaction and most regions that were
> > around 6-7 GB n size were taking 4-5 minutes to finish. Based on this
> > we estimated it would take something like 20 days for a single run to
> > finish, which doesn't seem reasonable.
> >
> > So is it more reasonable to aim for doing major compaction across all
> > region servers at once but within a RS one region at a time? That
> > would cut it down to around 8 hours which is still very long. Or is it
> > better to compact all regions on one region server, then move to the
> next?
> >
> > The goal of all this is to maintain decent write performance while
> > still doing compaction. We don't have a good very low load period for
> > our cluster so trying to find a way to do this without cluster downtime.
> >
> > Thanks.
> >
> > 
> > Saad
> >
> >
> > On Wed, Apr 20, 2016 at 1:19 PM, Saad Mufti <saad.mu...@gmail.com
> > <javascript:;>> wrote:
> >
> > > Thanks for the pointer. Working like a charm.
> > >
> > > 
> > > Saad
> > >
> > >
> > > On Tue, Apr 19, 2016 at 4:01 PM, Ted Yu <yuzhih...@gmail.com
> > <javascript:;>> wrote:
> > >
> > >> Please use the following method of HBaseAdmin:
> > >>
> > >>   public CompactionState getCompactionStateForRegion(final byte[]
> > >> regionName)
> > >>
> > >> Cheers
> > >>
> > >> On Tue, Apr 19, 2016 at 12:56 PM, Saad Mufti <saad.mu...@gmail.com
> > <javascript:;>>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > We have a large HBase 1.x cluster in AWS and have disabled
> > >> > automatic
> > >> major
> > >> > compaction as advised. We were running our own code for
> > >> > compaction daily around midnight which calls
> > >> > HBaseAdmin.majorCompactRegion(byte[]
> > >> > regionName) in a rolling fashion across all regions.
> > >> >
> > >> > But we missed the fact tha

Re: Major Compaction Strategy

2016-04-29 Thread Saad Mufti
Unfortunately all our tables and regions are active 24/7.  Traffic does
fall some at night but there is no real downtime.

It is not user facing load though so we could I guess turn off traffic for
a while as data queues up in Kafka. But not too long as then we're playing
catch up.


Saad

On Friday, April 29, 2016, Frank Luo <j...@merkleinc.com> wrote:

> Saad,
>
> Will all your tables/regions be used 24/7, or at any time, just a part of
> regions used and others are running ideal?
>
> If latter, I developed a tool to launch major-compact in a "smart" way,
> because I am facing a similar issue.
> https://github.com/jinyeluo/smarthbasecompactor.
>
> It looks at every RegionServer, and find non-hot regions with most store
> files and starts compacting. It just continue going until time is up. Just
> to be clear, it doesn't perform MC itself, which is a scary thing to do,
> but tell region servers to do MC.
>
> We have it running in our cluster for about 10 hours a day and it has
> virtually no impact to applications and the cluster is doing far better
> than when using default scheduled MC.
>
>
> -Original Message-
> From: Saad Mufti [mailto:saad.mu...@gmail.com <javascript:;>]
> Sent: Friday, April 29, 2016 1:51 PM
> To: user@hbase.apache.org <javascript:;>
> Subject: Re: Major Compaction Strategy
>
> We have more issues now, after testing this in dev, in our production
> cluster which has tons of data (60 regions servers and around 7000
> regions), we tried to do rolling compaction and most regions that were
> around 6-7 GB n size were taking 4-5 minutes to finish. Based on this we
> estimated it would take something like 20 days for a single run to finish,
> which doesn't seem reasonable.
>
> So is it more reasonable to aim for doing major compaction across all
> region servers at once but within a RS one region at a time? That would cut
> it down to around 8 hours which is still very long. Or is it better to
> compact all regions on one region server, then move to the next?
>
> The goal of all this is to maintain decent write performance while still
> doing compaction. We don't have a good very low load period for our cluster
> so trying to find a way to do this without cluster downtime.
>
> Thanks.
>
> 
> Saad
>
>
> On Wed, Apr 20, 2016 at 1:19 PM, Saad Mufti <saad.mu...@gmail.com
> <javascript:;>> wrote:
>
> > Thanks for the pointer. Working like a charm.
> >
> > 
> > Saad
> >
> >
> > On Tue, Apr 19, 2016 at 4:01 PM, Ted Yu <yuzhih...@gmail.com
> <javascript:;>> wrote:
> >
> >> Please use the following method of HBaseAdmin:
> >>
> >>   public CompactionState getCompactionStateForRegion(final byte[]
> >> regionName)
> >>
> >> Cheers
> >>
> >> On Tue, Apr 19, 2016 at 12:56 PM, Saad Mufti <saad.mu...@gmail.com
> <javascript:;>>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > We have a large HBase 1.x cluster in AWS and have disabled
> >> > automatic
> >> major
> >> > compaction as advised. We were running our own code for compaction
> >> > daily around midnight which calls
> >> > HBaseAdmin.majorCompactRegion(byte[]
> >> > regionName) in a rolling fashion across all regions.
> >> >
> >> > But we missed the fact that this is an asynchronous operation, so
> >> > in practice this causes major compaction to run across all regions,
> >> > at
> >> least
> >> > those not already major compacted (for example because previous
> >> > minor compactions got upgraded to major ones).
> >> >
> >> > We don't really have a suitable low load period, so what is a
> >> > suitable
> >> way
> >> > to make major compaction run in a rolling fashion region by region?
> >> > The
> >> API
> >> > above provides no return value for us to be able to wait for one
> >> compaction
> >> > to finish before moving to the next.
> >> >
> >> > Thanks.
> >> >
> >> > 
> >> > Saad
> >> >
> >>
> >
> >
> This email and any attachments transmitted with it are intended for use by
> the intended recipient(s) only. If you have received this email in error,
> please notify the sender immediately and then delete it. If you are not the
> intended recipient, you must not keep, use, disclose, copy or distribute
> this email without the author’s prior permission. We take precautions to
> minimize the risk of transmitting software viruses, but we advise you to
> perform your own virus checks on any attachment to this message. We cannot
> accept liability for any loss or damage caused by software viruses. The
> information contained in this communication may be confidential and may be
> subject to the attorney-client privilege.
>


Re: Major Compaction Strategy

2016-04-29 Thread Saad Mufti
We have more issues now, after testing this in dev, in our production
cluster which has tons of data (60 regions servers and around 7000
regions), we tried to do rolling compaction and most regions that were
around 6-7 GB n size were taking 4-5 minutes to finish. Based on this we
estimated it would take something like 20 days for a single run to finish,
which doesn't seem reasonable.

So is it more reasonable to aim for doing major compaction across all
region servers at once but within a RS one region at a time? That would cut
it down to around 8 hours which is still very long. Or is it better to
compact all regions on one region server, then move to the next?

The goal of all this is to maintain decent write performance while still
doing compaction. We don't have a good very low load period for our cluster
so trying to find a way to do this without cluster downtime.

Thanks.


Saad


On Wed, Apr 20, 2016 at 1:19 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Thanks for the pointer. Working like a charm.
>
> 
> Saad
>
>
> On Tue, Apr 19, 2016 at 4:01 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Please use the following method of HBaseAdmin:
>>
>>   public CompactionState getCompactionStateForRegion(final byte[]
>> regionName)
>>
>> Cheers
>>
>> On Tue, Apr 19, 2016 at 12:56 PM, Saad Mufti <saad.mu...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > We have a large HBase 1.x cluster in AWS and have disabled automatic
>> major
>> > compaction as advised. We were running our own code for compaction daily
>> > around midnight which calls HBaseAdmin.majorCompactRegion(byte[]
>> > regionName) in a rolling fashion across all regions.
>> >
>> > But we missed the fact that this is an asynchronous operation, so in
>> > practice this causes major compaction to run across all regions, at
>> least
>> > those not already major compacted (for example because previous minor
>> > compactions got upgraded to major ones).
>> >
>> > We don't really have a suitable low load period, so what is a suitable
>> way
>> > to make major compaction run in a rolling fashion region by region? The
>> API
>> > above provides no return value for us to be able to wait for one
>> compaction
>> > to finish before moving to the next.
>> >
>> > Thanks.
>> >
>> > 
>> > Saad
>> >
>>
>
>


Re: Slow sync cost

2016-04-27 Thread Saad Mufti
Thanks, that is a lot of useful information. I have a lot of things to look
at now in my cluster and API clients.

Cheers.


Saad


On Wed, Apr 27, 2016 at 3:28 PM, Bryan Beaudreault <bbeaudrea...@hubspot.com
> wrote:

> We turned off auto-splitting by setting our region sizes to very large
> (100gb). We split them manually when they become too unwieldy from a
> compaction POV.
>
> We do use BufferedMutators in a number of places. They are pretty
> straightforward, and definitely improve performance. The only lessons
> learned there would be to use low buffer sizes. You'll get a lot of
> benefits from just 1MB size, but if you want to go higher than that, you
> should should aim for less than half of your G1GC region size. Anything
> larger than that is considered a humongous object, and has implications for
> garbage collection. The blog post I linked earlier goes into humongous
> objects:
>
> http://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection#HumongousObjects
> .
> We've seen them to be very bad for GC performance when many of them come in
> at once.
>
> So for us, most of our regionservers are 40gb+ heaps, which for that we use
> 32mb G1GC regions. With 32mb G1GC regions, we aim for all buffered mutators
> to use less than 16mb buffer sizes -- we even go further to limit it to
> around 10mb just to be safe. We also do the same for reads -- we try to
> limit all scanner and multiget responses to less than 10mb.
>
> We've created a dashboard with our internal monitoring system which shows
> the count of requests that we consider too large, for all applications (we
> have many 100s of deployed applications hitting these clusters). It's on
> the individual teams that own the applications to try to drive that count
> down to 0. We've built into HBase a detention queue (similar to quotas),
> where we can put any of these applications based on their username if they
> are doing something that is adversely affecting the rest of the system. For
> instance if they started spamming a lot of too large requests, or badly
> filtered scans, etc. In the detention queue, they use their own RPC
> handlers which we can aggressively limit or reject if need be to preserve
> the cluster.
>
> Hope this helps
>
> On Wed, Apr 27, 2016 at 2:54 PM Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Hi Bryan,
> >
> > In Hubspot do you use a single shared (per-JVM) BufferedMutator anywhere
> in
> > an attempt to get better performance? Any lessons learned from any
> > attempts? Has it hurt or helped?
> >
> > Also do you have any experience with write performance in conjunction
> with
> > auto-splitting activity kicking in, either with BufferedMutator or
> > separately with just direct Put's?
> >
> > Thanks.
> >
> > 
> > Saad
> >
> >
> >
> >
> > On Wed, Apr 27, 2016 at 2:22 PM, Bryan Beaudreault <
> > bbeaudrea...@hubspot.com
> > > wrote:
> >
> > > Hey Ted,
> > >
> > > Actually, gc_log_visualizer is open-sourced, I will ask the author to
> > > update the post with links:
> https://github.com/HubSpot/gc_log_visualizer
> > >
> > > The author was taking a foundational approach with this blog post. We
> do
> > > use ParallelGC for backend non-API deployables, such as kafka consumers
> > and
> > > long running daemons, etc. However, we treat HBase like our API's, in
> > that
> > > it must have low latency requests. So we use G1GC for HBase.
> > >
> > > Expect another blog post from another HubSpot engineer soon, with all
> the
> > > details on how we approached G1GC tuning for HBase. I will update this
> > list
> > > when it's published, and will put some pressure on that author to get
> it
> > > out there :)
> > >
> > > On Wed, Apr 27, 2016 at 2:01 PM Ted Yu <yuzhih...@gmail.com> wrote:
> > >
> > > > Bryan:
> > > > w.r.t. gc_log_visualizer, is there plan to open source it ?
> > > >
> > > > bq. while backend throughput will be better/cheaper with ParallelGC.
> > > >
> > > > Does the above mean that hbase servers are still using ParallelGC ?
> > > >
> > > > Thanks
> > > >
> > > > On Wed, Apr 27, 2016 at 7:39 AM, Bryan Beaudreault <
> > > > bbeaudrea...@hubspot.com
> > > > > wrote:
> > > >
> > > > > We have 6 production clusters and all of them are tuned
> differently,
> > so
> > > > I'm
> > > > &

Re: HBase Write Performance Under Auto-Split

2016-04-27 Thread Saad Mufti
Thanks for the feedback. We already disabled automatic major compaction,
looks like we have to do the same for auto-splitting.


Saad


On Wed, Apr 27, 2016 at 3:26 PM, Vladimir Rodionov <vladrodio...@gmail.com>
wrote:

> Every split results in major compactions for both daughter regions.
> Concurrent major compactions across a cluster is bad.
> I recommend you to set DisabledRegionSplitPolicy on your table(s) and run
> splits manually - you will have control on what and when should be split.
> The same is true for major compactions: disable periodic major compactions
> and run them manually.
>
> -Vlad
>
> On Wed, Apr 27, 2016 at 8:27 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Hi,
> >
> > Does anyone have experience with HBase write performance under auto-split
> > conditions? Out keyspace is randomized so all regions roughly start
> > auto-splitting around the same time, although early on when we had the
> 1024
> > regions we started with, they all decided to do so within an hour or so
> and
> > now that we're up to 6000 regions the process seems to be spread over 12
> > hours or more as they slowly reach their size thresholds.
> >
> > During this time, our writes, for which we use a shared BufferedMutator
> > suffer as writes time out and the underlying AsyncProcess thread pool
> seems
> > to fill up. Which means callers to our service see their response times
> > shoot up as they spend time trying to drain the buffer and submit
> mutations
> > to the thread pool. So overall system time suffers and we can't keep up
> > with our input load.
> >
> > Are there any guidelines on the size of the BufferedMutator to use? We
> are
> > even considering running performance tests without the BufferedMutator to
> > see if it is buying us anything. Currently we have it sized pretty large
> at
> > around 50 MB but maybe having it too big is not a good idea.
> >
> > Any help/advice would be most appreciated.
> >
> > Thanks.
> >
> > 
> > Saad
> >
>


Re: Slow sync cost

2016-04-27 Thread Saad Mufti
Hi Bryan,

In Hubspot do you use a single shared (per-JVM) BufferedMutator anywhere in
an attempt to get better performance? Any lessons learned from any
attempts? Has it hurt or helped?

Also do you have any experience with write performance in conjunction with
auto-splitting activity kicking in, either with BufferedMutator or
separately with just direct Put's?

Thanks.


Saad




On Wed, Apr 27, 2016 at 2:22 PM, Bryan Beaudreault <bbeaudrea...@hubspot.com
> wrote:

> Hey Ted,
>
> Actually, gc_log_visualizer is open-sourced, I will ask the author to
> update the post with links: https://github.com/HubSpot/gc_log_visualizer
>
> The author was taking a foundational approach with this blog post. We do
> use ParallelGC for backend non-API deployables, such as kafka consumers and
> long running daemons, etc. However, we treat HBase like our API's, in that
> it must have low latency requests. So we use G1GC for HBase.
>
> Expect another blog post from another HubSpot engineer soon, with all the
> details on how we approached G1GC tuning for HBase. I will update this list
> when it's published, and will put some pressure on that author to get it
> out there :)
>
> On Wed, Apr 27, 2016 at 2:01 PM Ted Yu <yuzhih...@gmail.com> wrote:
>
> > Bryan:
> > w.r.t. gc_log_visualizer, is there plan to open source it ?
> >
> > bq. while backend throughput will be better/cheaper with ParallelGC.
> >
> > Does the above mean that hbase servers are still using ParallelGC ?
> >
> > Thanks
> >
> > On Wed, Apr 27, 2016 at 7:39 AM, Bryan Beaudreault <
> > bbeaudrea...@hubspot.com
> > > wrote:
> >
> > > We have 6 production clusters and all of them are tuned differently, so
> > I'm
> > > not sure there is a setting I could easily give you. It really depends
> on
> > > the usage.  One of our devs wrote a blog post on G1GC fundamentals
> > > recently. It's rather long, but could be worth a read:
> > >
> > >
> >
> http://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection
> > >
> > > We will also have a blog post coming out in the next week or so that
> > talks
> > > specifically to tuning G1GC for HBase. I can update this thread when
> > that's
> > > available.
> > >
> > > On Tue, Apr 26, 2016 at 8:08 PM Saad Mufti <saad.mu...@gmail.com>
> wrote:
> > >
> > > > That is interesting. Would it be possible for you to share what GC
> > > settings
> > > > you ended up on that gave you the most predictable performance?
> > > >
> > > > Thanks.
> > > >
> > > > 
> > > > Saad
> > > >
> > > >
> > > > On Tue, Apr 26, 2016 at 11:56 AM, Bryan Beaudreault <
> > > > bbeaudrea...@hubspot.com> wrote:
> > > >
> > > > > We were seeing this for a while with our CDH5 HBase clusters too.
> We
> > > > > eventually correlated it very closely to GC pauses. Through heavily
> > > > tuning
> > > > > our GC we were able to drastically reduce the logs, by keeping most
> > > GC's
> > > > > under 100ms.
> > > > >
> > > > > On Tue, Apr 26, 2016 at 6:25 AM Saad Mufti <saad.mu...@gmail.com>
> > > wrote:
> > > > >
> > > > > > From what I can see in the source code, the default is actually
> > even
> > > > > lower
> > > > > > at 100 ms (can be overridden with
> > > hbase.regionserver.hlog.slowsync.ms
> > > > ).
> > > > > >
> > > > > > 
> > > > > > Saad
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 26, 2016 at 3:13 AM, Kevin Bowling <
> > > > kevin.bowl...@kev009.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I see similar log spam while system has reasonable performance.
> > > Was
> > > > > the
> > > > > > > 250ms default chosen with SSDs and 10ge in mind or something?
> I
> > > > guess
> > > > > > I'm
> > > > > > > surprised a sync write several times through JVMs to 2 remote
> > > > datanodes
> > > > > > > would be expected to consistently happen that fast.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > On Mon, A

Re: Slow sync cost

2016-04-27 Thread Saad Mufti
Thanks, looks like an interesting read, will go try to absorb all the
information.


Saad


On Wed, Apr 27, 2016 at 10:39 AM, Bryan Beaudreault <
bbeaudrea...@hubspot.com> wrote:

> We have 6 production clusters and all of them are tuned differently, so I'm
> not sure there is a setting I could easily give you. It really depends on
> the usage.  One of our devs wrote a blog post on G1GC fundamentals
> recently. It's rather long, but could be worth a read:
>
> http://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection
>
> We will also have a blog post coming out in the next week or so that talks
> specifically to tuning G1GC for HBase. I can update this thread when that's
> available.
>
> On Tue, Apr 26, 2016 at 8:08 PM Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > That is interesting. Would it be possible for you to share what GC
> settings
> > you ended up on that gave you the most predictable performance?
> >
> > Thanks.
> >
> > 
> > Saad
> >
> >
> > On Tue, Apr 26, 2016 at 11:56 AM, Bryan Beaudreault <
> > bbeaudrea...@hubspot.com> wrote:
> >
> > > We were seeing this for a while with our CDH5 HBase clusters too. We
> > > eventually correlated it very closely to GC pauses. Through heavily
> > tuning
> > > our GC we were able to drastically reduce the logs, by keeping most
> GC's
> > > under 100ms.
> > >
> > > On Tue, Apr 26, 2016 at 6:25 AM Saad Mufti <saad.mu...@gmail.com>
> wrote:
> > >
> > > > From what I can see in the source code, the default is actually even
> > > lower
> > > > at 100 ms (can be overridden with
> hbase.regionserver.hlog.slowsync.ms
> > ).
> > > >
> > > > 
> > > > Saad
> > > >
> > > >
> > > > On Tue, Apr 26, 2016 at 3:13 AM, Kevin Bowling <
> > kevin.bowl...@kev009.com
> > > >
> > > > wrote:
> > > >
> > > > > I see similar log spam while system has reasonable performance.
> Was
> > > the
> > > > > 250ms default chosen with SSDs and 10ge in mind or something?  I
> > guess
> > > > I'm
> > > > > surprised a sync write several times through JVMs to 2 remote
> > datanodes
> > > > > would be expected to consistently happen that fast.
> > > > >
> > > > > Regards,
> > > > >
> > > > > On Mon, Apr 25, 2016 at 12:18 PM, Saad Mufti <saad.mu...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > In our large HBase cluster based on CDH 5.5 in AWS, we're
> > constantly
> > > > > seeing
> > > > > > the following messages in the region server logs:
> > > > > >
> > > > > > 2016-04-25 14:02:55,178 INFO
> > > > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> > 258
> > > > ms,
> > > > > > current pipeline:
> > > > > > [DatanodeInfoWithStorage[10.99.182.165:50010
> > > > > > ,DS-281d4c4f-23bd-4541-bedb-946e57a0f0fd,DISK],
> > > > > > DatanodeInfoWithStorage[10.99.182.236:50010
> > > > > > ,DS-f8e7e8c9-6fa0-446d-a6e5-122ab35b6f7c,DISK],
> > > > > > DatanodeInfoWithStorage[10.99.182.195:50010
> > > > > > ,DS-3beae344-5a4a-4759-ad79-a61beabcc09d,DISK]]
> > > > > >
> > > > > >
> > > > > > These happen regularly while HBase appear to be operating
> normally
> > > with
> > > > > > decent read and write performance. We do have occasional
> > performance
> > > > > > problems when regions are auto-splitting, and at first I thought
> > this
> > > > was
> > > > > > related but now I se it happens all the time.
> > > > > >
> > > > > >
> > > > > > Can someone explain what this means really and should we be
> > > concerned?
> > > > I
> > > > > > tracked down the source code that outputs it in
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> > > > > >
> > > > > > but after going through the code I think I'd need to know much
> more
> > > > about
> > > > > > the code to glean anything from it or the associated JIRA ticket
> > > > > > https://issues.apache.org/jira/browse/HBASE-11240.
> > > > > >
> > > > > > Also, what is this "pipeline" the ticket and code talks about?
> > > > > >
> > > > > > Thanks in advance for any information and/or clarification anyone
> > can
> > > > > > provide.
> > > > > >
> > > > > > 
> > > > > >
> > > > > > Saad
> > > > > >
> > > > >
> > > >
> > >
> >
>


HBase Write Performance Under Auto-Split

2016-04-27 Thread Saad Mufti
Hi,

Does anyone have experience with HBase write performance under auto-split
conditions? Out keyspace is randomized so all regions roughly start
auto-splitting around the same time, although early on when we had the 1024
regions we started with, they all decided to do so within an hour or so and
now that we're up to 6000 regions the process seems to be spread over 12
hours or more as they slowly reach their size thresholds.

During this time, our writes, for which we use a shared BufferedMutator
suffer as writes time out and the underlying AsyncProcess thread pool seems
to fill up. Which means callers to our service see their response times
shoot up as they spend time trying to drain the buffer and submit mutations
to the thread pool. So overall system time suffers and we can't keep up
with our input load.

Are there any guidelines on the size of the BufferedMutator to use? We are
even considering running performance tests without the BufferedMutator to
see if it is buying us anything. Currently we have it sized pretty large at
around 50 MB but maybe having it too big is not a good idea.

Any help/advice would be most appreciated.

Thanks.


Saad


Re: Slow sync cost

2016-04-26 Thread Saad Mufti
That is interesting. Would it be possible for you to share what GC settings
you ended up on that gave you the most predictable performance?

Thanks.


Saad


On Tue, Apr 26, 2016 at 11:56 AM, Bryan Beaudreault <
bbeaudrea...@hubspot.com> wrote:

> We were seeing this for a while with our CDH5 HBase clusters too. We
> eventually correlated it very closely to GC pauses. Through heavily tuning
> our GC we were able to drastically reduce the logs, by keeping most GC's
> under 100ms.
>
> On Tue, Apr 26, 2016 at 6:25 AM Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > From what I can see in the source code, the default is actually even
> lower
> > at 100 ms (can be overridden with hbase.regionserver.hlog.slowsync.ms).
> >
> > 
> > Saad
> >
> >
> > On Tue, Apr 26, 2016 at 3:13 AM, Kevin Bowling <kevin.bowl...@kev009.com
> >
> > wrote:
> >
> > > I see similar log spam while system has reasonable performance.  Was
> the
> > > 250ms default chosen with SSDs and 10ge in mind or something?  I guess
> > I'm
> > > surprised a sync write several times through JVMs to 2 remote datanodes
> > > would be expected to consistently happen that fast.
> > >
> > > Regards,
> > >
> > > On Mon, Apr 25, 2016 at 12:18 PM, Saad Mufti <saad.mu...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > In our large HBase cluster based on CDH 5.5 in AWS, we're constantly
> > > seeing
> > > > the following messages in the region server logs:
> > > >
> > > > 2016-04-25 14:02:55,178 INFO
> > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 258
> > ms,
> > > > current pipeline:
> > > > [DatanodeInfoWithStorage[10.99.182.165:50010
> > > > ,DS-281d4c4f-23bd-4541-bedb-946e57a0f0fd,DISK],
> > > > DatanodeInfoWithStorage[10.99.182.236:50010
> > > > ,DS-f8e7e8c9-6fa0-446d-a6e5-122ab35b6f7c,DISK],
> > > > DatanodeInfoWithStorage[10.99.182.195:50010
> > > > ,DS-3beae344-5a4a-4759-ad79-a61beabcc09d,DISK]]
> > > >
> > > >
> > > > These happen regularly while HBase appear to be operating normally
> with
> > > > decent read and write performance. We do have occasional performance
> > > > problems when regions are auto-splitting, and at first I thought this
> > was
> > > > related but now I se it happens all the time.
> > > >
> > > >
> > > > Can someone explain what this means really and should we be
> concerned?
> > I
> > > > tracked down the source code that outputs it in
> > > >
> > > >
> > > >
> > >
> >
> hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> > > >
> > > > but after going through the code I think I'd need to know much more
> > about
> > > > the code to glean anything from it or the associated JIRA ticket
> > > > https://issues.apache.org/jira/browse/HBASE-11240.
> > > >
> > > > Also, what is this "pipeline" the ticket and code talks about?
> > > >
> > > > Thanks in advance for any information and/or clarification anyone can
> > > > provide.
> > > >
> > > > 
> > > >
> > > > Saad
> > > >
> > >
> >
>


Re: Slow sync cost

2016-04-26 Thread Saad Mufti
>From what I can see in the source code, the default is actually even lower
at 100 ms (can be overridden with hbase.regionserver.hlog.slowsync.ms).


Saad


On Tue, Apr 26, 2016 at 3:13 AM, Kevin Bowling <kevin.bowl...@kev009.com>
wrote:

> I see similar log spam while system has reasonable performance.  Was the
> 250ms default chosen with SSDs and 10ge in mind or something?  I guess I'm
> surprised a sync write several times through JVMs to 2 remote datanodes
> would be expected to consistently happen that fast.
>
> Regards,
>
> On Mon, Apr 25, 2016 at 12:18 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Hi,
> >
> > In our large HBase cluster based on CDH 5.5 in AWS, we're constantly
> seeing
> > the following messages in the region server logs:
> >
> > 2016-04-25 14:02:55,178 INFO
> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 258 ms,
> > current pipeline:
> > [DatanodeInfoWithStorage[10.99.182.165:50010
> > ,DS-281d4c4f-23bd-4541-bedb-946e57a0f0fd,DISK],
> > DatanodeInfoWithStorage[10.99.182.236:50010
> > ,DS-f8e7e8c9-6fa0-446d-a6e5-122ab35b6f7c,DISK],
> > DatanodeInfoWithStorage[10.99.182.195:50010
> > ,DS-3beae344-5a4a-4759-ad79-a61beabcc09d,DISK]]
> >
> >
> > These happen regularly while HBase appear to be operating normally with
> > decent read and write performance. We do have occasional performance
> > problems when regions are auto-splitting, and at first I thought this was
> > related but now I se it happens all the time.
> >
> >
> > Can someone explain what this means really and should we be concerned? I
> > tracked down the source code that outputs it in
> >
> >
> >
> hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> >
> > but after going through the code I think I'd need to know much more about
> > the code to glean anything from it or the associated JIRA ticket
> > https://issues.apache.org/jira/browse/HBASE-11240.
> >
> > Also, what is this "pipeline" the ticket and code talks about?
> >
> > Thanks in advance for any information and/or clarification anyone can
> > provide.
> >
> > 
> >
> > Saad
> >
>


Re: Slow sync cost

2016-04-25 Thread Saad Mufti
Thanks, the meaning makes sense now but I still need to figure out why we
keep seeing this. How would I know if this is just us overloading the
capacity of our system with too much write load vs some configuration
problem?

Thanks.


Saad

On Mon, Apr 25, 2016 at 4:25 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> w.r.t. the pipeline, please see this description:
>
> http://itm-vm.shidler.hawaii.edu/HDFS/ArchDocUseCases.html
>
> On Mon, Apr 25, 2016 at 12:18 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Hi,
> >
> > In our large HBase cluster based on CDH 5.5 in AWS, we're constantly
> seeing
> > the following messages in the region server logs:
> >
> > 2016-04-25 14:02:55,178 INFO
> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 258 ms,
> > current pipeline:
> > [DatanodeInfoWithStorage[10.99.182.165:50010
> > ,DS-281d4c4f-23bd-4541-bedb-946e57a0f0fd,DISK],
> > DatanodeInfoWithStorage[10.99.182.236:50010
> > ,DS-f8e7e8c9-6fa0-446d-a6e5-122ab35b6f7c,DISK],
> > DatanodeInfoWithStorage[10.99.182.195:50010
> > ,DS-3beae344-5a4a-4759-ad79-a61beabcc09d,DISK]]
> >
> >
> > These happen regularly while HBase appear to be operating normally with
> > decent read and write performance. We do have occasional performance
> > problems when regions are auto-splitting, and at first I thought this was
> > related but now I se it happens all the time.
> >
> >
> > Can someone explain what this means really and should we be concerned? I
> > tracked down the source code that outputs it in
> >
> >
> >
> hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> >
> > but after going through the code I think I'd need to know much more about
> > the code to glean anything from it or the associated JIRA ticket
> > https://issues.apache.org/jira/browse/HBASE-11240.
> >
> > Also, what is this "pipeline" the ticket and code talks about?
> >
> > Thanks in advance for any information and/or clarification anyone can
> > provide.
> >
> > 
> >
> > Saad
> >
>


Re: Hbase shell script from java

2016-04-25 Thread Saad Mufti
I don't know why you would want to do this all through the hbase shell if
your main driver is Java, unless you have a lot of existing complicated
scripts you want to leverage. Why not just write Java code against the
standard Java hbase client?

Or if you need different parameters for each time you invoke a script,
parameterize your script and invoke it with different parameters each time.
HBase shell scripts are just Ruby scripts with special built in HBase
commands/functions, so parameterize it in the same way you'd parameterize
any Ruby script.


Saad


On Mon, Apr 25, 2016 at 7:10 PM, Saurabh Malviya (samalviy) <
samal...@cisco.com> wrote:

> Thanks, That will work for me.
>
> I am just curious how people are doing in industry, Suppose take a case
> where you have more than 100 tables and need to modify table script a lot
> for each deployment for performance or other reasons.
>
> -Saurabh
>
> -----Original Message-
> From: Saad Mufti [mailto:saad.mu...@gmail.com]
> Sent: Sunday, April 24, 2016 2:55 PM
> To: user@hbase.apache.org
> Subject: Re: Hbase shell script from java
>
> Why can't you install hbase on your local machine, with the configuration
> pointing it to your desired cluster, then run the hbase shell and its
> script locally?
>
> I believe the HBase web UI has a convenient link to download client
> configuration.
>
> 
> Saad
>
>
> On Sun, Apr 24, 2016 at 5:22 PM, Saurabh Malviya (samalviy) <
> samal...@cisco.com> wrote:
>
> > I need to execute this command remotely.
> >
> > Right now I am SSH into hbase master and execute the script, which is
> > not the most elegant way to do.
> >
> > Saurabh
> >
> > Please see
> >
> >
> > https://blog.art-of-coding.eu/executing-operating-system-commands-from
> > -java/
> >
> > > On Apr 23, 2016, at 10:18 PM, Saurabh Malviya (samalviy) <
> > samal...@cisco.com> wrote:
> > >
> > > Hi,
> > >
> > > Is there any way to run hbase shell script from Java. Also mentioned
> > this question earlier in below url earlier.
> > >
> > > As we are having bunch of scripts and need to change frequently for
> > performance tuning.
> > >
> > >
> > http://grokbase.com/p/hbase/user/161ezbnk11/run-hbase-shell-script-fro
> > m-java
> > >
> > >
> > > -Saurabh
> >
>


Slow sync cost

2016-04-25 Thread Saad Mufti
Hi,

In our large HBase cluster based on CDH 5.5 in AWS, we're constantly seeing
the following messages in the region server logs:

2016-04-25 14:02:55,178 INFO
org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 258 ms,
current pipeline:
[DatanodeInfoWithStorage[10.99.182.165:50010,DS-281d4c4f-23bd-4541-bedb-946e57a0f0fd,DISK],
DatanodeInfoWithStorage[10.99.182.236:50010,DS-f8e7e8c9-6fa0-446d-a6e5-122ab35b6f7c,DISK],
DatanodeInfoWithStorage[10.99.182.195:50010
,DS-3beae344-5a4a-4759-ad79-a61beabcc09d,DISK]]


These happen regularly while HBase appear to be operating normally with
decent read and write performance. We do have occasional performance
problems when regions are auto-splitting, and at first I thought this was
related but now I se it happens all the time.


Can someone explain what this means really and should we be concerned? I
tracked down the source code that outputs it in

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java

but after going through the code I think I'd need to know much more about
the code to glean anything from it or the associated JIRA ticket
https://issues.apache.org/jira/browse/HBASE-11240.

Also, what is this "pipeline" the ticket and code talks about?

Thanks in advance for any information and/or clarification anyone can
provide.



Saad


Re: Hbase shell script from java

2016-04-24 Thread Saad Mufti
Why can't you install hbase on your local machine, with the configuration
pointing it to your desired cluster, then run the hbase shell and its
script locally?

I believe the HBase web UI has a convenient link to download client
configuration.


Saad


On Sun, Apr 24, 2016 at 5:22 PM, Saurabh Malviya (samalviy) <
samal...@cisco.com> wrote:

> I need to execute this command remotely.
>
> Right now I am SSH into hbase master and execute the script, which is not
> the most elegant way to do.
>
> Saurabh
>
> Please see
>
>
> https://blog.art-of-coding.eu/executing-operating-system-commands-from-java/
>
> > On Apr 23, 2016, at 10:18 PM, Saurabh Malviya (samalviy) <
> samal...@cisco.com> wrote:
> >
> > Hi,
> >
> > Is there any way to run hbase shell script from Java. Also mentioned
> this question earlier in below url earlier.
> >
> > As we are having bunch of scripts and need to change frequently for
> performance tuning.
> >
> >
> http://grokbase.com/p/hbase/user/161ezbnk11/run-hbase-shell-script-from-java
> >
> >
> > -Saurabh
>


Re: zero data locality

2016-04-20 Thread Saad Mufti
This is from just one region server right? are you sure it is co-located
with an HDFS data node after your upgrade?

I imagine that is pretty obvious thing to check but the only thing I can
think of.


Saad


On Wed, Apr 20, 2016 at 10:30 AM, Ted Tuttle  wrote:

> Hello-
>
> We just upgraded to HBase 1.2.  After the upgrade I see Data Locality is
> precisely zero for all regions of all regions servers.
>
> See screenshot here: https://snag.gy/2r3CSl
>
> This seems unlikely as I can see many major compactions of regions have
> occurred since our upgrade.
>
> Any ideas what is wrong with status page?
>
> A related question: post-upgrade we enabled HDFS short-circuit.  Is there
> a log signature I can search for to verify we've configured this correctly?
>
> Thanks,
> Ted
>
>


Re: Major Compaction Strategy

2016-04-20 Thread Saad Mufti
Thanks for the pointer. Working like a charm.


Saad


On Tue, Apr 19, 2016 at 4:01 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Please use the following method of HBaseAdmin:
>
>   public CompactionState getCompactionStateForRegion(final byte[]
> regionName)
>
> Cheers
>
> On Tue, Apr 19, 2016 at 12:56 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Hi,
> >
> > We have a large HBase 1.x cluster in AWS and have disabled automatic
> major
> > compaction as advised. We were running our own code for compaction daily
> > around midnight which calls HBaseAdmin.majorCompactRegion(byte[]
> > regionName) in a rolling fashion across all regions.
> >
> > But we missed the fact that this is an asynchronous operation, so in
> > practice this causes major compaction to run across all regions, at least
> > those not already major compacted (for example because previous minor
> > compactions got upgraded to major ones).
> >
> > We don't really have a suitable low load period, so what is a suitable
> way
> > to make major compaction run in a rolling fashion region by region? The
> API
> > above provides no return value for us to be able to wait for one
> compaction
> > to finish before moving to the next.
> >
> > Thanks.
> >
> > 
> > Saad
> >
>


Re: Sources Of HBase Client Side Latency

2016-04-20 Thread Saad Mufti
Thanks, good advice. At a cursory glance at region server logs (didn't find
anything interesting in master logs), I already see some GC pauses and
complaints about "Slow sync cost". Will ask here with further details if
need be after further analysis.

Cheers.


Saad


On Tue, Apr 19, 2016 at 6:35 PM, Stack <st...@duboce.net> wrote:

> On Tue, Apr 19, 2016 at 2:07 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>
> > Hi,
> >
> > I found this blog post from 2014 on sources of HBase client side latency
> > which I found useful:
> >
> >
> >
> https://hadoop-hbase.blogspot.com/2014/08/hbase-client-response-times.html?showComment=1461099797978#c5266762058464276023
> >
> >
> It is written by an authority.
>
>
>
> > Since this is a bit dated, anyone have any other sources of latency to
> add?
> > In our experience with HBase 1.x so far we've definitely seen long
> > latencies when a region server dies, but we've also seen it during
> > auto-splits which this post suggests shouldn't be that long.
> >
> >
> Easy enough to check. Look at the master log and see the steps involved.
>
>
>
> > Then we have other unexplained (so far, at least by us) big response time
> > spikes and client side timeouts that can last a few minutes or in some
> > cases a couple of hours that we'd like to explain. Could these be from
> > either automatic major compaction or minor compactions that got upgraded
> to
> > major?
> >
> >
> These may up latency when running but not for minutes or hours.
>
> Sounds like something else is going on.
>
>
>
> > Any advice on where to start looking to investigate?
> >
> >
> Check master log at the time of slowness and then follow your nose. If you
> need more specifics on how to debug, come back here w/ some more detail
> around a particular event.
>
> Yours,
> St.Ack
>
>
> > Thanks.
> >
> > 
> > Saad
> >
>


Sources Of HBase Client Side Latency

2016-04-19 Thread Saad Mufti
Hi,

I found this blog post from 2014 on sources of HBase client side latency
which I found useful:

https://hadoop-hbase.blogspot.com/2014/08/hbase-client-response-times.html?showComment=1461099797978#c5266762058464276023

Since this is a bit dated, anyone have any other sources of latency to add?
In our experience with HBase 1.x so far we've definitely seen long
latencies when a region server dies, but we've also seen it during
auto-splits which this post suggests shouldn't be that long.

Then we have other unexplained (so far, at least by us) big response time
spikes and client side timeouts that can last a few minutes or in some
cases a couple of hours that we'd like to explain. Could these be from
either automatic major compaction or minor compactions that got upgraded to
major?

Any advice on where to start looking to investigate?

Thanks.


Saad


Major Compaction Strategy

2016-04-19 Thread Saad Mufti
Hi,

We have a large HBase 1.x cluster in AWS and have disabled automatic major
compaction as advised. We were running our own code for compaction daily
around midnight which calls HBaseAdmin.majorCompactRegion(byte[]
regionName) in a rolling fashion across all regions.

But we missed the fact that this is an asynchronous operation, so in
practice this causes major compaction to run across all regions, at least
those not already major compacted (for example because previous minor
compactions got upgraded to major ones).

We don't really have a suitable low load period, so what is a suitable way
to make major compaction run in a rolling fashion region by region? The API
above provides no return value for us to be able to wait for one compaction
to finish before moving to the next.

Thanks.


Saad


CDH Hbase 1.0.0 Requested row out of range during auto-split

2016-04-02 Thread Saad Mufti
Hi,

We have a large 60 node CDH 5.5.2 Hbase 1.0.0 cluster that take a very
heavy write load. For increased performance, we are using the
BufferedMutator class in hbase-client, although we're using hbase-client
version 1.2.0 because it has a small performance fix to this class.

It seems to be working fine whether the Hbase region servers are undergoing
minor compaction or not, but at some point a lot of them decide to start
auto-splitting. We configure the tables at the beginning with an explicit
set of splits for a total of 1024 regions at the beginning. By now we are
up to 4096 or so regions based on auto-splitting.

We have observed that as soon as auto-splitting starts, the mutations in
the BufferedMutator start throwing exceptions of the following form (after
retries I presume, we are configured for 5 retries and a 3 second RPC
timeout):

BufferedMutator Failed to send mutation
{"totalColumns":1,"row":"0998-f0b05361-5983-46d0-9443-4ca845987aad","families":{"s1":[{"qualifier":"s-1123-20240198","vlen":28,"tag":[],"timestamp":1459655306487}]},"ttl":72693513}
. org.apache.hadoop.hbase.exceptions.FailedSanityCheckException:
org.apache.hadoop.hbase.exceptions.FailedSanityCheckException: Requested
row out of range for doMiniBatchMutation on HRegion
[region-actual-value-removed],
startKey='0998-UP27f54301-cf7b-11e5-858a-00163ebce99d', getEndKey()='0999',
row='0759-UPbe8df720-c9d5-11e5-947f-00163ee0563a' at
org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:688)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:639)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1931)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213)

Does anyone know the cause for this? I am new to Hbase, in most other
distributed systems I have worked with, the system handles rerouting, so I
was expecting that the Hbase region server would automatically route to the
correct region server/region during splits.

Any help that anyone can provide would be much appreciated. I can provide
more config settings if they turn out to be relevant but there are a ton of
them in Hbase of course and I am not sure which ones would be relevant here.

Thanks in advance.



Saad