[jira] [Created] (HBASE-11007) BLOCKCACHE in schema descriptor seems not aptly named

2014-04-16 Thread Varun Sharma (JIRA)
Varun Sharma created HBASE-11007:


 Summary: BLOCKCACHE in schema descriptor seems not aptly named
 Key: HBASE-11007
 URL: https://issues.apache.org/jira/browse/HBASE-11007
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.94.18
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor


Hi,

It seems that setting BLOCKCACHE key to false will disable the Data blocks from 
being cached but will continue to cache bloom and index blocks. This same 
property seems to be called cacheDataOnRead inside CacheConfig.

Should this be called CACHE_DATA_ON_READ instead of BLOCKCACHE similar to the 
other CACHE_DATA_ON_WRITE/CACHE_INDEX_ON_WRITE. We got quite confused and ended 
up adding our own property CACHE_DATA_ON_READ - we also added some unit tests 
for the same.

What do folks think about this ?

Thanks
Varun



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11007) BLOCKCACHE in schema descriptor seems not aptly named

2014-04-16 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972100#comment-13972100
 ] 

Varun Sharma commented on HBASE-11007:
--

I agree on the backward compatibility issue. How about at least adding a test, 
as part of this issue, on this property which exercises the code path and makes 
sure that it only touches Data Blocks ? I found several tests on 
CACHE_DATA_ON_WRITE etc. but none for this option ? Did I miss the unit test by 
any chance ?

Thanks !
Varun

 BLOCKCACHE in schema descriptor seems not aptly named
 -

 Key: HBASE-11007
 URL: https://issues.apache.org/jira/browse/HBASE-11007
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.94.18
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor

 Hi,
 It seems that setting BLOCKCACHE key to false will disable the Data blocks 
 from being cached but will continue to cache bloom and index blocks. This 
 same property seems to be called cacheDataOnRead inside CacheConfig.
 Should this be called CACHE_DATA_ON_READ instead of BLOCKCACHE similar to the 
 other CACHE_DATA_ON_WRITE/CACHE_INDEX_ON_WRITE. We got quite confused and 
 ended up adding our own property CACHE_DATA_ON_READ - we also added some unit 
 tests for the same.
 What do folks think about this ?
 Thanks
 Varun



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-8836) Separate reader and writer thread pool in RegionServer, so that write throughput will not be impacted when the read load is very high

2013-10-28 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806615#comment-13806615
 ] 

Varun Sharma commented on HBASE-8836:
-

Hi,

We suffered from the same issue. However, in our case there were too many 
writes compared to reads. Basically the writes starved all the handlers, the 
reads as such executed fast but ended up getting queued. We really cared about 
the reads and so hacked a change to our HBase binary to build a separate pool 
for Get(s) (We do Gets for our reads). That improved the read latencies vastly 
- p99 went down from 500ms down to  100ms. We did not raise a JIRA since we 
thought that the change would not be accepted.

This patch is very similar to what we are doing (a pool of GET Handlers with 
QoS annotations), so it would be great if this could be patched in.

Thanks
Varun

 Separate reader and writer thread pool in RegionServer, so that write 
 throughput will not be impacted when the read load is very high
 -

 Key: HBASE-8836
 URL: https://issues.apache.org/jira/browse/HBASE-8836
 Project: HBase
  Issue Type: New Feature
  Components: Performance, regionserver
Affects Versions: 0.94.8
Reporter: Tianying Chang
Assignee: Tianying Chang
 Fix For: 0.98.0

 Attachments: hbase-8836.patch, Hbase-8836-perfNumber.pdf, 
 HBase-8836-QosAnotation.patch, threadPool-write-NoWAL.png, 
 threadPool-write-WithWAL.png


 We found that when the read load on a specific RS is high, the write 
 throughput also get impacted dramatically, and even cause write data loss 
 sometimes. We want to prioritize the write by putting them in a separate 
 queue from the read request, so that slower read will not make fast write 
 wait nu-necessarily long.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-8434) Allow enabling hbase 8354 to support real lease recovery

2013-08-02 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13727959#comment-13727959
 ] 

Varun Sharma commented on HBASE-8434:
-

Can be closed since 8670 is now in..

 Allow enabling hbase 8354 to support real lease recovery
 

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.12

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is truly recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-07-18 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712577#comment-13712577
 ] 

Varun Sharma commented on HBASE-8599:
-

0.94 patch looks good. Thanks [~lhofhansl]

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.95.2, 0.94.10

 Attachments: 8599-0.94.patch, 8599-0.94-v2.txt, 8599-trunk.patch, 
 8599-trunk-v2.patch, 8599-trunk-v3.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-07-10 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Attachment: (was: 8599-trunk-v3.patch)

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.94.10

 Attachments: 8599-0.94.patch, 8599-trunk.patch, 8599-trunk-v2.patch, 
 8599-trunk-v3.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-07-10 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Attachment: 8599-trunk-v3.patch

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.94.10

 Attachments: 8599-0.94.patch, 8599-trunk.patch, 8599-trunk-v2.patch, 
 8599-trunk-v3.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8434) Allow enabling hbase 8354 to support real lease recovery

2013-07-10 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704786#comment-13704786
 ] 

Varun Sharma commented on HBASE-8434:
-

So with 0.94 there are two cases:

1) Without mandatory lease recovery
In this case, the above config could add upto 60s + 60s or 2 minutes to MTTR. 
This is because the first lease recovery will never truly happen without HDFS 
4721 - the reason being that the dead datanode will be chosen as primary data 
node but then that data node is no longer heart beating - so the lease recovery 
will never truly commence. That adds 60 seconds, then another 60 second wait 
after the second invocation.

For folks, who don't really care about lease recovery (minor data loss), I am 
not sure if changing the config is the write thign to do.

2) With mandatory lease recovery
Again we add 120 seconds to MTTR, but then, we will end up enforcing real lease 
recovery even with the default hdfs timeouts.

 Allow enabling hbase 8354 to support real lease recovery
 

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.10

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is truly recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8370) Report data block cache hit rates apart from aggregate cache hit rates

2013-07-10 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704808#comment-13704808
 ] 

Varun Sharma commented on HBASE-8370:
-

Looking at some metrics on a cluster (read heavy) 

1) Index blocks - 99.998 % hit rate
2) Bloom blocks - 99.98 %
3) Data blocks - 95.4 %

I think Data block cache hit rates might be more actionable than either the 
combined/index/bloom hit rates ?

 Report data block cache hit rates apart from aggregate cache hit rates
 --

 Key: HBASE-8370
 URL: https://issues.apache.org/jira/browse/HBASE-8370
 Project: HBase
  Issue Type: Improvement
  Components: metrics
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor

 Attaching from mail to d...@hbase.apache.org
 I am wondering whether the HBase cachingHitRatio metrics that the region 
 server UI shows, can get me a break down by data blocks. I always see this 
 number to be very high and that could be exagerated by the fact that each 
 lookup hits the index blocks and bloom filter blocks in the block cache 
 before retrieving the data block. This could be artificially bloating up the 
 cache hit ratio.
 Assuming the above is correct, do we already have a cache hit ratio for data 
 blocks alone which is more obscure ? If not, my sense is that it would be 
 pretty valuable to add one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-07-09 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Attachment: 8599-trunk-v3.patch

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.94.10

 Attachments: 8599-0.94.patch, 8599-trunk.patch, 8599-trunk-v2.patch, 
 8599-trunk-v3.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-07-09 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13704092#comment-13704092
 ] 

Varun Sharma commented on HBASE-8599:
-

Attached a v3 without the logging...

Do we want to do the 0.94 backport in this JIRA ?

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.94.10

 Attachments: 8599-0.94.patch, 8599-trunk.patch, 8599-trunk-v2.patch, 
 8599-trunk-v3.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-07-08 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702608#comment-13702608
 ] 

Varun Sharma commented on HBASE-8599:
-

I actually found it a little useful for testing and also since we will remove 
logs only upon log roll etc., it should cause minimal logging ? Do you think it 
might be too much logging ?

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.94.10

 Attachments: 8599-0.94.patch, 8599-trunk.patch, 8599-trunk-v2.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-07-03 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699493#comment-13699493
 ] 

Varun Sharma commented on HBASE-8599:
-

Friendly ping folks, this passes replication tests. I can create the backport 
for 0.94 if it looks good.

Thanks
Varun

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.94.10

 Attachments: 8599-0.94.patch, 8599-trunk.patch, 8599-trunk-v2.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-07-03 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699734#comment-13699734
 ] 

Varun Sharma commented on HBASE-8599:
-

I tried that approach in v1 but it consistently kept failing tests which i 
could not fix

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.94.10

 Attachments: 8599-0.94.patch, 8599-trunk.patch, 8599-trunk-v2.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8813) Fix time b/w recoverLease invocations from HBASE 8449

2013-06-27 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694884#comment-13694884
 ] 

Varun Sharma commented on HBASE-8813:
-

Attached patch increasing to 64 seconds.

 Fix time b/w recoverLease invocations from HBASE 8449
 -

 Key: HBASE-8813
 URL: https://issues.apache.org/jira/browse/HBASE-8813
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.0, 0.95.1
Reporter: Varun Sharma
 Attachments: 8813.patch


 The time b/w recover lease attempts is conservative but is still not correct. 
 It does not factor in Datanode heartbeat time intervals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8813) Fix time b/w recoverLease invocations from HBASE 8449

2013-06-27 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8813:


Attachment: 8813.patch

 Fix time b/w recoverLease invocations from HBASE 8449
 -

 Key: HBASE-8813
 URL: https://issues.apache.org/jira/browse/HBASE-8813
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.0, 0.95.1
Reporter: Varun Sharma
 Attachments: 8813.patch


 The time b/w recover lease attempts is conservative but is still not correct. 
 It does not factor in Datanode heartbeat time intervals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8813) Fix time b/w recoverLease invocations from HBASE 8449

2013-06-27 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8813:


Assignee: Varun Sharma
  Status: Patch Available  (was: Open)

 Fix time b/w recoverLease invocations from HBASE 8449
 -

 Key: HBASE-8813
 URL: https://issues.apache.org/jira/browse/HBASE-8813
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.95.1, 0.98.0
Reporter: Varun Sharma
Assignee: Varun Sharma
 Attachments: 8813.patch


 The time b/w recover lease attempts is conservative but is still not correct. 
 It does not factor in Datanode heartbeat time intervals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8815) A replicated cross cluster client

2013-06-27 Thread Varun Sharma (JIRA)
Varun Sharma created HBASE-8815:
---

 Summary: A replicated cross cluster client
 Key: HBASE-8815
 URL: https://issues.apache.org/jira/browse/HBASE-8815
 Project: HBase
  Issue Type: New Feature
Reporter: Varun Sharma


I would like to float this idea for brain storming.

HBase is a strongly consistent system modelled after bigtable which means a 
machine going down results in loss of availability of around 2 minutes as it 
stands today. So there is a trade off.

However, for high availability and redundancy, it is common practice for 
online/mission critical applications to run replicated clusters. For example, 
we run replicated clusters at pinterest in different EC2 az(s) and at google, 
critical data is always replicated across bigtable cells.

At high volumes, 2 minutes of downtime can also be critical, however, today our 
client does not make use of the fact, that there is an available slave replica 
cluster from which slightly inconsistent data can be read. It only reads from 
one cluster. When you have replication, it is a very common practice for 
reading from slave if the error rate from master is high. That is how, web 
sites serve data out of MySQL and survive machine failures by directing their 
reads to slave machines when the master goes down.

I am sure folks love the strong consistency gaurantee from HBase, but I think 
that this way, we can make better use of the replica cluster, much in the same 
way people use MySQL slaves for reads. In case of regions going offline, it 
would be nice if, for the offline regions only (a small fraction), reads could 
be directed to the slave cluster.

I know one company which follows this model. At Google, a replicated client api 
is used for reads which is able to farm reads to multiple clusters and also 
writes to multiple clusters depending on availability in case of Multi master 
replication.
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8370) Report data block cache hit rates apart from aggregate cache hit rates

2013-06-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693798#comment-13693798
 ] 

Varun Sharma commented on HBASE-8370:
-

Here are some stats for this JIRA - I am arguing that the BlockCacheHit ratio 
number reported on a region server does not mean much.

tbl.feeds.cf.home.bt.Index.fsBlockReadCnt : 46864,
tbl.feeds.cf.home.bt.Index.fsBlockReadCacheHitCnt : 46864

Index Block cache hit ratio = 100 %

tbl.feeds.cf.home.bt.Data.fsBlockReadCacheHitCnt : 202
tbl.feeds.cf.home.bt.Data.fsBlockReadCnt : 247

Data Block cache hit ratio = 82 %

Overall Cache hit ration = (46864 + 202) / (46864 + 247) = 99 %

Since Indexes are hit often, cache hits are 100 % and also # of hits is high. 
The real number that we are concerned about, is 82 % which is hit rate on the 
data block. However, we continue to show the # 99 % on the region server 
console instead. I think we need to fix that number. Please let me know if 
folks object to this ?

 Report data block cache hit rates apart from aggregate cache hit rates
 --

 Key: HBASE-8370
 URL: https://issues.apache.org/jira/browse/HBASE-8370
 Project: HBase
  Issue Type: Improvement
  Components: metrics
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor

 Attaching from mail to d...@hbase.apache.org
 I am wondering whether the HBase cachingHitRatio metrics that the region 
 server UI shows, can get me a break down by data blocks. I always see this 
 number to be very high and that could be exagerated by the fact that each 
 lookup hits the index blocks and bloom filter blocks in the block cache 
 before retrieving the data block. This could be artificially bloating up the 
 cache hit ratio.
 Assuming the above is correct, do we already have a cache hit ratio for data 
 blocks alone which is more obscure ? If not, my sense is that it would be 
 pretty valuable to add one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8370) Report data block cache hit rates apart from aggregate cache hit rates

2013-06-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694296#comment-13694296
 ] 

Varun Sharma commented on HBASE-8370:
-

So, it seems that we have per block type metrics from SchemaMetrics under the 
region server and they are exposed as /jmx.

The question is, which metric should we report on the region server UI. Right 
now all our clusters 99 % cache hit ratio which is false, since 20 % percent of 
the time there is a DataBlock miss and we are hitting disk for 20 % of requests.

I have been misled by this number in the past, and I think there could be 
others, who are being similarly misled. So, should we just report another more 
representative metric on the region server console.

Varun

 Report data block cache hit rates apart from aggregate cache hit rates
 --

 Key: HBASE-8370
 URL: https://issues.apache.org/jira/browse/HBASE-8370
 Project: HBase
  Issue Type: Improvement
  Components: metrics
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor

 Attaching from mail to d...@hbase.apache.org
 I am wondering whether the HBase cachingHitRatio metrics that the region 
 server UI shows, can get me a break down by data blocks. I always see this 
 number to be very high and that could be exagerated by the fact that each 
 lookup hits the index blocks and bloom filter blocks in the block cache 
 before retrieving the data block. This could be artificially bloating up the 
 cache hit ratio.
 Assuming the above is correct, do we already have a cache hit ratio for data 
 blocks alone which is more obscure ? If not, my sense is that it would be 
 pretty valuable to add one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8370) Report data block cache hit rates apart from aggregate cache hit rates

2013-06-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694336#comment-13694336
 ] 

Varun Sharma commented on HBASE-8370:
-

We don't have per block type metrics in trunk/95 because the overall cache hit 
percentage is a good proxy for data block cache percent. Yes the overall number 
is higher but it still gives a good actionable number. You can know if you're 
doing better or worse than you were before. Even better is the derivative of 
cache miss count.

I am not sure this is true - this number is always 99 % for us on all clusters 
- blockCacheHitCachingRation - how can a number which never changes, ever be 
actionable ? Even with decimal numbers, its never going to change because the 
index blocks are going to take over

Also, the different b/w 82 % cache hit ratio to 99 % cache hit ratio is 
enormous. Controlling you p80 on latency is a *lot* easier than your p99. A 
cache hit ratio of 99 % just sends you this false sense of security that you 
have controlled your p99 latency. This is important for online serving, maynot 
be for enterprise.

I guess, we don't need to bring back SchemaMetrics to fix this but we can have 
block level metrics. At least I want to be sure that Index blocks have 100 % 
cache hit rates because if that's not happening, then I am in a bad situation. 
It would be better to not have folks using HBase for online storage, play a 
guessing game, as to what the true effectiveness of the cache is.


 Report data block cache hit rates apart from aggregate cache hit rates
 --

 Key: HBASE-8370
 URL: https://issues.apache.org/jira/browse/HBASE-8370
 Project: HBase
  Issue Type: Improvement
  Components: metrics
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor

 Attaching from mail to d...@hbase.apache.org
 I am wondering whether the HBase cachingHitRatio metrics that the region 
 server UI shows, can get me a break down by data blocks. I always see this 
 number to be very high and that could be exagerated by the fact that each 
 lookup hits the index blocks and bloom filter blocks in the block cache 
 before retrieving the data block. This could be artificially bloating up the 
 cache hit ratio.
 Assuming the above is correct, do we already have a cache hit ratio for data 
 blocks alone which is more obscure ? If not, my sense is that it would be 
 pretty valuable to add one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8370) Report data block cache hit rates apart from aggregate cache hit rates

2013-06-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694361#comment-13694361
 ] 

Varun Sharma commented on HBASE-8370:
-

Having a cache hit ratio of 80 % means that at least 80 % of my requests are 
fast (assuming GC out of picture) - in the current scenario, it may map to a 
number like 99.9 % and tomorrow if I had 0 % cache hits for data blocks, the 
number comes down to 99.5 % - I am able to calculate this based on the numbers 
I paste above. It assumes a certain distribution b/w number of accesses to 
Index blocks and Data blocks. Tomorrow, if the distribution changes, it may 
well be that 99.5 % overall cache hit ratio corresponds to 90 % hit rate on 
data blocks. So, I don't think that Overall cache hit ratio is a good proxy 
for Data block cache hit ratio.

As far as derivatives go, Miss count derivative can go up with other things 
like read request count - so now we would also need to do a derivate on that 
counter and compare etc. On 0.94, that number has been overflowing for us all 
the time and is -ve, is that being fixed in trunk ?

I dont think this is about counters vs gauges. I am fine with exposing counters 
per block type. Right now, I just don't have any insight into the block cache 
which plays an important role in serving reads. When a compaction happens and 
new files are written, I dont know the number of cache misses for Index block 
vs Data block vs Bloom block. I would no longer know how many Data blocks are 
being accessed and how many Index blocks etc


 Report data block cache hit rates apart from aggregate cache hit rates
 --

 Key: HBASE-8370
 URL: https://issues.apache.org/jira/browse/HBASE-8370
 Project: HBase
  Issue Type: Improvement
  Components: metrics
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor

 Attaching from mail to d...@hbase.apache.org
 I am wondering whether the HBase cachingHitRatio metrics that the region 
 server UI shows, can get me a break down by data blocks. I always see this 
 number to be very high and that could be exagerated by the fact that each 
 lookup hits the index blocks and bloom filter blocks in the block cache 
 before retrieving the data block. This could be artificially bloating up the 
 cache hit ratio.
 Assuming the above is correct, do we already have a cache hit ratio for data 
 blocks alone which is more obscure ? If not, my sense is that it would be 
 pretty valuable to add one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8370) Report data block cache hit rates apart from aggregate cache hit rates

2013-06-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694389#comment-13694389
 ] 

Varun Sharma commented on HBASE-8370:
-

We can make the hit percent a double.

But if we never evict index blocks, one option is to only count DataBlocks for 
HitPercent, CacheHitCount, CacheMissCount. I know that is not the case for 
0.94. Is that the case for trunk or can we change these metrics to only 
instrument data blocks then ?

Anyone else have opinions ?

 Report data block cache hit rates apart from aggregate cache hit rates
 --

 Key: HBASE-8370
 URL: https://issues.apache.org/jira/browse/HBASE-8370
 Project: HBase
  Issue Type: Improvement
  Components: metrics
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor

 Attaching from mail to d...@hbase.apache.org
 I am wondering whether the HBase cachingHitRatio metrics that the region 
 server UI shows, can get me a break down by data blocks. I always see this 
 number to be very high and that could be exagerated by the fact that each 
 lookup hits the index blocks and bloom filter blocks in the block cache 
 before retrieving the data block. This could be artificially bloating up the 
 cache hit ratio.
 Assuming the above is correct, do we already have a cache hit ratio for data 
 blocks alone which is more obscure ? If not, my sense is that it would be 
 pretty valuable to add one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8370) Report data block cache hit rates apart from aggregate cache hit rates

2013-06-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694433#comment-13694433
 ] 

Varun Sharma commented on HBASE-8370:
-

Also, coming back to the point about the metrics being actionable.

RE:If your bloom block cache hit count goes down you can do... Not much. 
Not worth counting if you can't take action on it.

I disagree that its not actionable. I would go fix the block cache in this 
case. It means there is something seriously wrong with our implementation of 
the block cache if we are evicting bloom blocks - maybe its just me but I feel 
we should not be evicting bloom blocks.

- If the cache hit rate is too low on Data Blocks, the action item is to 
increase Block Cache amount.

I would agree that index block metrics are not needed or actionable if it is 
indeed the case that we pin index blocks forever.





 Report data block cache hit rates apart from aggregate cache hit rates
 --

 Key: HBASE-8370
 URL: https://issues.apache.org/jira/browse/HBASE-8370
 Project: HBase
  Issue Type: Improvement
  Components: metrics
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor

 Attaching from mail to d...@hbase.apache.org
 I am wondering whether the HBase cachingHitRatio metrics that the region 
 server UI shows, can get me a break down by data blocks. I always see this 
 number to be very high and that could be exagerated by the fact that each 
 lookup hits the index blocks and bloom filter blocks in the block cache 
 before retrieving the data block. This could be artificially bloating up the 
 cache hit ratio.
 Assuming the above is correct, do we already have a cache hit ratio for data 
 blocks alone which is more obscure ? If not, my sense is that it would be 
 pretty valuable to add one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8434) Allow enabling hbase 8354 to support real lease recovery

2013-06-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13694488#comment-13694488
 ] 

Varun Sharma commented on HBASE-8434:
-

HBASE 8670 is a backport for HBASE 8449 which refactors lease recovery retries 
in wake of findings in HBASE 8389. 

In HBase 8389, we basically decided to give up lease recovery for 0.94 and 
accept data loss. The reason for giving up lease recovery was because it 
required a whole bunch of timeouts to fit in nicely or else you could have 
infinite race conditions and a super long recovery from region server failures. 
HBASE 8449 tried to fix it by refactoring these retries and timeouts a little 
bit and HBASE 8670 is trying to backport that refactoring.

On the other hand, this one is a simple change which is allowing us to enable 
mandatory lease recovery through a config parameter who care about their data 
but also requires them to run with appropriate timeouts.

I agree that this is not required for 0.94 but otherwise we may have data loss.

 Allow enabling hbase 8354 to support real lease recovery
 

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.10

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is truly recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8813) Fix time b/w recoverLease invocations from HBASE 8449

2013-06-26 Thread Varun Sharma (JIRA)
Varun Sharma created HBASE-8813:
---

 Summary: Fix time b/w recoverLease invocations from HBASE 8449
 Key: HBASE-8813
 URL: https://issues.apache.org/jira/browse/HBASE-8813
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.95.1, 0.98.0
Reporter: Varun Sharma


The time b/w recover lease attempts is conservative but is still not correct. 
It does not factor in Datanode heartbeat time intervals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-06-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693584#comment-13693584
 ] 

Varun Sharma commented on HBASE-8599:
-

Attached a v2 for trunk which passes on my local machine. If this passes 
jenkins, will attach a patch for 0.94 along the same lines.

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.94.10

 Attachments: 8599-0.94.patch, 8599-trunk.patch, 8599-trunk-v2.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-06-25 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Attachment: 8599-trunk-v2.patch

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.94.10

 Attachments: 8599-0.94.patch, 8599-trunk.patch, 8599-trunk-v2.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8401) Backport HBASE-8284 Allow String Offset(s) in ColumnPaginationFilter for bookmark based pagination

2013-06-21 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690833#comment-13690833
 ] 

Varun Sharma commented on HBASE-8401:
-

Going back to your comment about backward compatibility being broken. The code 
goes like this, for serialization:

   {
 this.limit = in.readInt();
 this.offset = in.readInt();
+this.columnOffset = Bytes.readByteArray(in);
+if (this.columnOffset.length == 0) {
+  this.columnOffset = null;
+}
   }
 
   public void write(DataOutput out) throws IOException
   {
 out.writeInt(this.limit);
 out.writeInt(this.offset);
+Bytes.writeByteArray(out, this.columnOffset);
   }

Here is my understanding of how it impacts compatibility and it maybe wrong..
1) Client has patch but region server does not
In this case, client would marshall extra data which will be discarded by the 
region server

2) Client does not have patch but region server does
In this case, region server would read no additional data and assume the 
columnOffset to be null.

I am okay with introducing a new Filter class for this or otherwise, 
introducing a coprocessor - I am not certain of the performance implications 
but we have been running the Filter implementation in production for a while. 
As far as utility, looking at the other JIRA 8284, there was interest in 
implementing this functionality.


 Backport HBASE-8284 Allow String Offset(s) in ColumnPaginationFilter for 
 bookmark based pagination
 

 Key: HBASE-8401
 URL: https://issues.apache.org/jira/browse/HBASE-8401
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Varun Sharma

 This issue is for discussion of whether or not backport HBASE-8284.  It has 
 been applied to trunk and 0.95.  A patch for 0.94 is over on hbase-8284

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8626) RowMutations fail when Delete and Put on same columnFamily/column/row

2013-05-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667367#comment-13667367
 ] 

Varun Sharma commented on HBASE-8626:
-

I think this is manifestation of 
https://issues.apache.org/jira/browse/HBASE-2256

This is a known issue and I dont think we can fix it since the delete and put 
get the same timestamp. The client needs to specify a latter timestamp for the 
put(s) than the deletes manually to get around this.

 RowMutations fail when Delete and Put on same columnFamily/column/row
 -

 Key: HBASE-8626
 URL: https://issues.apache.org/jira/browse/HBASE-8626
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.7, 0.95.0
 Environment: Ubuntu 12.04, HBase 0.94.7
Reporter: Vinod
 Fix For: 0.94.7, 0.95.1

 Attachments: TestRowMutations.java, tests_for_row_mutations1.patch


 When RowMutations have a Delete followed by Put to same column family or 
 columns or rows, only the Delete is happening while the Put is ignored so 
 atomicity of RowMutations is broken for such cases.
 Attached is a unit test where the following tests are failing:
 - testDeleteCFThenPutInSameCF: Delete a column family and then Put to same 
 column family.
 - testDeleteColumnThenPutSameColumn: Delete a column and then Put to same 
 column.
 - testDeleteRowThenPutSameRow: Delete a row and then Put to same row

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8626) RowMutations fail when Delete and Put on same columnFamily/column/row

2013-05-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667368#comment-13667368
 ] 

Varun Sharma commented on HBASE-8626:
-

Or one thing we could do is to have HBase give a higher timestamp to all ops 
other than Delete ops in RowMutations and give the Delete ops lower timestamp. 
That would be one way to fix this...

 RowMutations fail when Delete and Put on same columnFamily/column/row
 -

 Key: HBASE-8626
 URL: https://issues.apache.org/jira/browse/HBASE-8626
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.7, 0.95.0
 Environment: Ubuntu 12.04, HBase 0.94.7
Reporter: Vinod
 Fix For: 0.94.7, 0.95.1

 Attachments: TestRowMutations.java, tests_for_row_mutations1.patch


 When RowMutations have a Delete followed by Put to same column family or 
 columns or rows, only the Delete is happening while the Put is ignored so 
 atomicity of RowMutations is broken for such cases.
 Attached is a unit test where the following tests are failing:
 - testDeleteCFThenPutInSameCF: Delete a column family and then Put to same 
 column family.
 - testDeleteColumnThenPutSameColumn: Delete a column and then Put to same 
 column.
 - testDeleteRowThenPutSameRow: Delete a row and then Put to same row

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8626) RowMutations fail when Delete and Put on same columnFamily/column/row

2013-05-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667372#comment-13667372
 ] 

Varun Sharma commented on HBASE-8626:
-

I actually meant that we do this only for transactions which contain a mix of 
deletes and puts with overlaps like this one.

Another way to fix this would be to put the responsibility on the client to 
break out the mutations and possibly add some documentation..

 RowMutations fail when Delete and Put on same columnFamily/column/row
 -

 Key: HBASE-8626
 URL: https://issues.apache.org/jira/browse/HBASE-8626
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.7, 0.95.0
 Environment: Ubuntu 12.04, HBase 0.94.7
Reporter: Vinod
 Fix For: 0.94.7, 0.95.1

 Attachments: TestRowMutations.java, tests_for_row_mutations1.patch


 When RowMutations have a Delete followed by Put to same column family or 
 columns or rows, only the Delete is happening while the Put is ignored so 
 atomicity of RowMutations is broken for such cases.
 Attached is a unit test where the following tests are failing:
 - testDeleteCFThenPutInSameCF: Delete a column family and then Put to same 
 column family.
 - testDeleteColumnThenPutSameColumn: Delete a column and then Put to same 
 column.
 - testDeleteRowThenPutSameRow: Delete a row and then Put to same row

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-05-23 Thread Varun Sharma (JIRA)
Varun Sharma created HBASE-8599:
---

 Summary: HLogs in ZK are not cleaned up when replication lag is 
minimal
 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.94.7
Reporter: Varun Sharma


On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
on source), we found HLogs accumulating and not being cleaned up as new WAL(s) 
are rolled.

Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
the current WAL is not being written to any more - as suggested by 
currentWALBeingWrittenTo being false. However, when lags are small, we may hit 
the following block first and continue onto the next WAL without clearing the 
old WAL(s)...

ReplicationSource::run() {
if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
false)) {
// If we are here, then we advance to the next WAL without any cleaning
// and close existing WAL
continue;
}
// Ship some edits and call logPositionAndCleanOldLogs
}

If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
logs are not cleaned out and persist in the zookeeper node since we simply call 
continue and skip the subsequent logPositionAndCleanOldLogs call - if its 
called more than once, we do end up clearing the old logs.






--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-05-23 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Attachment: 8599-0.94.patch

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.94.7
Reporter: Varun Sharma
 Attachments: 8599-0.94.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-05-23 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13665445#comment-13665445
 ] 

Varun Sharma commented on HBASE-8599:
-

Attached a patch for 0.94 to clean out logs from ZK whenever we close a WAL in 
ReplicationSource and choose to advance to the next WAL.

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.94.7
Reporter: Varun Sharma
 Attachments: 8599-0.94.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-05-23 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Affects Version/s: 0.98.0

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
 Attachments: 8599-0.94.patch, 8599-trunk.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-05-23 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Attachment: 8599-trunk.patch

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
 Attachments: 8599-0.94.patch, 8599-trunk.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-05-23 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Fix Version/s: 0.94.9
   0.98.0

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
 Fix For: 0.98.0, 0.94.9

 Attachments: 8599-0.94.patch, 8599-trunk.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-05-23 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Attachment: (was: 8599-trunk.patch)

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
 Fix For: 0.98.0, 0.94.9

 Attachments: 8599-0.94.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-05-23 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Attachment: 8599-trunk.patch

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.98.0, 0.94.7
Reporter: Varun Sharma
 Fix For: 0.98.0, 0.94.9

 Attachments: 8599-0.94.patch, 8599-trunk.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8599) HLogs in ZK are not cleaned up when replication lag is minimal

2013-05-23 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8599:


Assignee: Varun Sharma
  Status: Patch Available  (was: Open)

 HLogs in ZK are not cleaned up when replication lag is minimal
 --

 Key: HBASE-8599
 URL: https://issues.apache.org/jira/browse/HBASE-8599
 Project: HBase
  Issue Type: Improvement
  Components: Replication
Affects Versions: 0.94.7, 0.98.0
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.98.0, 0.94.9

 Attachments: 8599-0.94.patch, 8599-trunk.patch


 On a cluster with very low replication lag (as measured by ageOfLastShippedOp 
 on source), we found HLogs accumulating and not being cleaned up as new 
 WAL(s) are rolled.
 Each time, we call logPositionAndCleanOldLogs() to clean older logs whenever 
 the current WAL is not being written to any more - as suggested by 
 currentWALBeingWrittenTo being false. However, when lags are small, we may 
 hit the following block first and continue onto the next WAL without clearing 
 the old WAL(s)...
 ReplicationSource::run() {
 if (readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo = 
 false)) {
 // If we are here, then we advance to the next WAL without any 
 cleaning
 // and close existing WAL
 continue;
 }
 // Ship some edits and call logPositionAndCleanOldLogs
 }
 If we hit readAllEntriesToReplicateOrNextFile(false) only once - then older 
 logs are not cleaned out and persist in the zookeeper node since we simply 
 call continue and skip the subsequent logPositionAndCleanOldLogs call - if 
 its called more than once, we do end up clearing the old logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8554) Scan should seek to startRow within a region

2013-05-15 Thread Varun Sharma (JIRA)
Varun Sharma created HBASE-8554:
---

 Summary: Scan should seek to startRow within a region
 Key: HBASE-8554
 URL: https://issues.apache.org/jira/browse/HBASE-8554
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma


Currently Scan.startRow() is only used for determining which Region to look up 
the row into but we do not seek within the region to the start row. Since its 
not uncommon to run with large sized regions these days 5G, this is suboptimal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8554) Scan should seek to startRow within a region

2013-05-15 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13658669#comment-13658669
 ] 

Varun Sharma commented on HBASE-8554:
-

I think this is tricky since we only want to seek for the first region and not 
the latter ones, but i still think this is doable by comparing with the current 
region's startKey and the Scan's startRow.

 Scan should seek to startRow within a region
 

 Key: HBASE-8554
 URL: https://issues.apache.org/jira/browse/HBASE-8554
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma

 Currently Scan.startRow() is only used for determining which Region to look 
 up the row into but we do not seek within the region to the start row. Since 
 its not uncommon to run with large sized regions these days 5G, this is 
 suboptimal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8554) Scan should seek to startRow within a region

2013-05-15 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13658720#comment-13658720
 ] 

Varun Sharma commented on HBASE-8554:
-

Agreed. Closing for now - I followed the response from Anoop and raised this 
issue... Thanks !

 Scan should seek to startRow within a region
 

 Key: HBASE-8554
 URL: https://issues.apache.org/jira/browse/HBASE-8554
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma

 Currently Scan.startRow() is only used for determining which Region to look 
 up the row into but we do not seek within the region to the start row. Since 
 its not uncommon to run with large sized regions these days 5G, this is 
 suboptimal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HBASE-8554) Scan should seek to startRow within a region

2013-05-15 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma resolved HBASE-8554.
-

Resolution: Fixed

 Scan should seek to startRow within a region
 

 Key: HBASE-8554
 URL: https://issues.apache.org/jira/browse/HBASE-8554
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma

 Currently Scan.startRow() is only used for determining which Region to look 
 up the row into but we do not seek within the region to the start row. Since 
 its not uncommon to run with large sized regions these days 5G, this is 
 suboptimal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (HBASE-8554) Scan should seek to startRow within a region

2013-05-15 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma reopened HBASE-8554:
-


 Scan should seek to startRow within a region
 

 Key: HBASE-8554
 URL: https://issues.apache.org/jira/browse/HBASE-8554
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma

 Currently Scan.startRow() is only used for determining which Region to look 
 up the row into but we do not seek within the region to the start row. Since 
 its not uncommon to run with large sized regions these days 5G, this is 
 suboptimal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HBASE-8554) Scan should seek to startRow within a region

2013-05-15 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma resolved HBASE-8554.
-

Resolution: Invalid

 Scan should seek to startRow within a region
 

 Key: HBASE-8554
 URL: https://issues.apache.org/jira/browse/HBASE-8554
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma

 Currently Scan.startRow() is only used for determining which Region to look 
 up the row into but we do not seek within the region to the start row. Since 
 its not uncommon to run with large sized regions these days 5G, this is 
 suboptimal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8362) Possible MultiGet optimization

2013-05-03 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648615#comment-13648615
 ] 

Varun Sharma commented on HBASE-8362:
-

Is this being actively worked on ?

Thanks
Varun

 Possible MultiGet optimization
 --

 Key: HBASE-8362
 URL: https://issues.apache.org/jira/browse/HBASE-8362
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl

 Currently MultiGets are executed on a RegionServer in a single thread in a 
 loop that handles each Get separately (opening a scanner, seeking, etc).
 It seems we could optimize this (per region at least) by opening a single 
 scanner and issue a reseek for each Get that was requested.
 I have not tested this yet and no patch, but I would like to solicit feedback 
 on this idea.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8434) Allow enabling hbase 8354 to support real lease recovery

2013-05-03 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648616#comment-13648616
 ] 

Varun Sharma commented on HBASE-8434:
-

Friendly ping...

 Allow enabling hbase 8354 to support real lease recovery
 

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is truly recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-05-02 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13647915#comment-13647915
 ] 

Varun Sharma commented on HBASE-8389:
-

No, it does not. We primarily fix it by fixing it in HDFS since thats where the 
core of the problem is (slow recovery).

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8434) Allow enabling hbase 8354 to support real lease recovery

2013-04-29 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644806#comment-13644806
 ] 

Varun Sharma commented on HBASE-8434:
-

8389 is pretty much closed now and 8449 is for, how to get around this issue in 
trunk. So, after 8389, we need to decide if we will ever do  a real lease 
recovery in 0.94

 Allow enabling hbase 8354 to support real lease recovery
 

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is truly recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8401) Backport HBASE-8284 Allow String Offset(s) in ColumnPaginationFilter for bookmark based pagination

2013-04-27 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8401:


Fix Version/s: 0.94.8

 Backport HBASE-8284 Allow String Offset(s) in ColumnPaginationFilter for 
 bookmark based pagination
 

 Key: HBASE-8401
 URL: https://issues.apache.org/jira/browse/HBASE-8401
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Varun Sharma
 Fix For: 0.94.8


 This issue is for discussion of whether or not backport hbase-8284.  It has 
 been applied to trunk and 0.95.  A patch for 0.94 is over on hbase-8284

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643136#comment-13643136
 ] 

Varun Sharma commented on HBASE-8389:
-

[~saint@gmail.com]
I can do a small write up that folks can refer to.

[~nkeywal]
One point regarding the low setting though. Its good for fast MTTR requirements 
such as online clusters but it does not work well if you pound a small cluster 
with mapreduce jobs. The write timeouts start kicking in on datanodes - we saw 
this on a small cluster. So it has to be taken with a pinch of salt.

I think 4 seconds might be too tight. Because we have the following sequence -
1) recoverLease called
2) The primary node heartbeats (this can be 3 seconds in the worst case)
3) There are multiple timeouts during recovery at primary datanode:
a) dfs.socket.timeout kicks in when we suspend the processes using kill 
-STOP - there is only 1 retry
b) ipc.client.connect.timeout is the troublemaker - on old hadoop versions 
it is hardcoded at 20 seconds. On some versions, the # of retries is hardcoded 
at 45. This can be trigger by firewalling a host using iptables to drop all 
incoming/outgoing TCP packets. Another issue here is that b/w the timeouts 
there is a 1 second hardcoded sleep :) - I just fixed it in HADOOP 9503. If we 
make sure that all the dfs.socket.timeout and ipc client settings are the same 
in hbase-site.xml and hdfs-site.xml. Then, we can

The retry rate should be no faster than 3a and 3b - or lease recoveries will 
accumulate for 900 seconds in trunk. To get around this problem, we would want 
to make sure that hbase-site.xml has the same settings as hdfs-site.xml. And we 
calculate the recovery interval from those settings. Otherwise, we can leave a 
release note saying that this number should be max(dfs.socket.timeout, 
ipc.client.connect.max.retries.on.timeouts * ipc.client.connect.timeout, 
ipc.client.connect.max.retries).

The advantage of having HDFS 4721 is that at some point the data node will be 
recognized as stale - maybe a little later than hdfs recovery. Once that 
happens, recoveries typically occuring within 2 seconds.

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either 

[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-26 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13643226#comment-13643226
 ] 

Varun Sharma commented on HBASE-8389:
-

Sorry about that...

If we make sure that all the dfs.socket.timeout and ipc client settings are the 
same in hbase-site.xml and hdfs-site.xml. Then, we can do a custom calculation 
of recover lease retry interval inside hbase. But basically hbase needs to 
know in some way how the timeouts are setup underneath.

Thanks
Varun

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8449) Refactor recoverLease retries and pauses informed by findings over in hbase-8389

2013-04-26 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8449:


Summary: Refactor recoverLease retries and pauses informed by findings over 
in hbase-8389  (was: Refactor recoverLease retries and pauses informed by 
findings over in hbase-8354)

 Refactor recoverLease retries and pauses informed by findings over in 
 hbase-8389
 

 Key: HBASE-8449
 URL: https://issues.apache.org/jira/browse/HBASE-8449
 Project: HBase
  Issue Type: Bug
  Components: Filesystem Integration
Affects Versions: 0.94.7, 0.95.0
Reporter: stack
Priority: Critical
 Fix For: 0.95.1


 HBASE-8354 is an interesting issue that roams near and far.  This issue is 
 about making use of the findings handily summarized on the end of hbase-8354 
 which have it that trunk needs refactor around how it does its recoverLease 
 handling (and that the patch committed against HBASE-8354 is not what we want 
 going forward).
 This issue is about making a patch that adds a lag between recoverLease 
 invocations where the lag is related to dfs timeouts -- the hdfs-side dfs 
 timeout -- and optionally makes use of the isFileClosed API if it is 
 available (a facility that is not yet committed to a branch near you and 
 unlikely to be within your locality with a good while to come).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8434) Allow users to trade lease recovery for MTTR

2013-04-25 Thread Varun Sharma (JIRA)
Varun Sharma created HBASE-8434:
---

 Summary: Allow users to trade lease recovery for MTTR
 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8


Please see discussion in HBase 8389.

For environments where lease recovery time can be bounded on the HDFS side 
through tight timeouts, provide a toggle for users who want the WAL splitting 
to continue only after the lease is recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8434) Allow users to trade lease recovery for MTTR

2013-04-25 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8434:


Attachment: 8434.patch

 Allow users to trade lease recovery for MTTR
 

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8434) Allow users to trade lease recovery for MTTR

2013-04-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641498#comment-13641498
 ] 

Varun Sharma commented on HBASE-8434:
-

[~enis]

Are you -1 on 8389 as well, then, it disables lease recovery by undoing 7878, 
if yes, can you communicate in on HBASE 8389 too. This patch gives on option to 
enable it, if we choose to do 8389...

 Allow users to trade lease recovery for MTTR
 

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641499#comment-13641499
 ] 

Varun Sharma commented on HBASE-8389:
-

As I said above, there is no bug in HDFS - for a moment I thought that patch v5 
only increased the timeout from 1s to 4s but it also reverts to old behaviour 
of not enforcing lease recovery so that we can reduce MTTR. If we choose to 
enforce lease recovery, then if this timeout is significantly lower than the 
time it takes to recover the lease (if we can't recover within 4s), our MTTR 
will be poor.

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641542#comment-13641542
 ] 

Varun Sharma commented on HBASE-8389:
-

Hi Nicholas,

Firstly I configure the HDFS cluster in the following way:

dfs.socket.timeout = 3sec
dfs.socket.write.timeout = 5sec
ipc.client.connect.timeout = 1sec
ipc.client.connect.max.retries.on.timeouts = 2 (hence total 3 retries)

The connect timeout is low since connecting should really be very fast unless 
something major is wrong. Our clusters are housed within the same AZ on amazon 
EC2 and it is very rare to see this timeouts even getting hit on EC2 which is 
known for poor I/O performance. I, for most, see this timeouts kick in during 
failures. Note that these timeouts are only used for avoiding bad datanodes and 
not for marking nodes as dead/stale, so i think these timeouts are okay for 
quick failovers - we already have high timeouts for dead node 
detection/zookeeper session (10's of seconds).

stale node timeout = 20 seconds
dead node timeout = 10 minutes
ZooKeeper session timeout = 30 seconds

HDFS is hadoop 2.0 with HDFS 3703, HDFS 3912 and HDFS 4721. The approach is the 
following:

a) A node is failed artificially using
  1) Use iptables to only allow ssh traffic and drop all traffic
  2) Suspending the processes

b) Even though we configure stale detection to be faster than hbase detection, 
lets assume that does not play out. The node is not marked stale.

c) Lease recovery attempt # 1
   i) We choose a good primary node for recovery - since its likely that the 
bad node has the worst possible heartbeat (HDFS 4721)
   ii) But we point it to recover from all 3 nodes since we are considering the 
worst case where no node is marked stale
   iii) The primary tries to reconcile the block with all 3 nodes and hits 
either
a) dfs.socket.timeout = 3 seconds - if process is suspended
b) ipc.connect.timeout X ipc.connect.retries which is 3 * 1 second + 3 
* 1 second sleep = 6 seconds - if we firewall the host using iptables

d) If we use a value of 4 seconds, the first recovery attempt does not finish 
in time and we initiate lease recovery #2
   i) Either a rinse and repeat of c) happens
   ii) Or the node is now stale and the block is instantly recovered from the 
remaining two replicas

I think by we could either adjust the timeout 4 seconds to be say 8 seconds and 
mostly be able to get the first attempt successful or otherwise, we just wait 
to get stale node detection and then we will have a fairly quick block recovery 
due to HDFS 4721.

I will try to test these values tomorrow, by rebooting some nodes...


 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to 

[jira] [Commented] (HBASE-8434) Allow users to trade lease recovery for MTTR

2013-04-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641546#comment-13641546
 ] 

Varun Sharma commented on HBASE-8434:
-

Okay - but then we need synchronous lease recovery for 0.94 - in HBASE 8389 - 
the patch v5 makes us skip lease recovery. I think either we revert that patch 
or we provide an option to force lease recovery.

As I said above, this one is 0.94 specific and not for trunk and depends on 
what will come out of 8389 - if we choose to ditch on lease recovery and revert 
to the same behaviour before HBASE 8354, then we should have this...

 Allow users to trade lease recovery for MTTR
 

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641804#comment-13641804
 ] 

Varun Sharma commented on HBASE-8389:
-

Okay. I am just not 100 % sure if we want to mark the datanode as stale if the 
region server crashes alone...

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8434) Allow a condition for enabling hbase 8354 to support lease recovery

2013-04-25 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8434:


Summary: Allow a condition for enabling hbase 8354 to support lease 
recovery  (was: Allow users to trade lease recovery for MTTR)

 Allow a condition for enabling hbase 8354 to support lease recovery
 ---

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8434) Allow a condition for enabling hbase 8354 to support lease recovery

2013-04-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641851#comment-13641851
 ] 

Varun Sharma commented on HBASE-8434:
-

I just changed the description and put it in simple words - the patch basically 
is:

Do we want to be able to conditionally enable HBase 7878/8354 for 0.94 if we 
choose to move forward with patch v5 in HBase 8389 ?

That makes it easier to decide since we already have a lot of discussion in 
7878 and 8389.

Thanks
Varun

 Allow a condition for enabling hbase 8354 to support lease recovery
 ---

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8434) Allow enabling hbase 8354 to support real lease recovery

2013-04-25 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8434:


Summary: Allow enabling hbase 8354 to support real lease recovery  (was: 
Allow a condition for enabling hbase 8354 to support lease recovery)

 Allow enabling hbase 8354 to support real lease recovery
 

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8434) Allow enabling hbase 8354 to support real lease recovery

2013-04-25 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8434:


Description: 
Please see discussion in HBase 8389.

For environments where lease recovery time can be bounded on the HDFS side 
through tight timeouts, provide a toggle for users who want the WAL splitting 
to continue only after the lease is truly recovered and the file is closed.

  was:
Please see discussion in HBase 8389.

For environments where lease recovery time can be bounded on the HDFS side 
through tight timeouts, provide a toggle for users who want the WAL splitting 
to continue only after the lease is recovered and the file is closed.


 Allow enabling hbase 8354 to support real lease recovery
 

 Key: HBASE-8434
 URL: https://issues.apache.org/jira/browse/HBASE-8434
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8434.patch


 Please see discussion in HBase 8389.
 For environments where lease recovery time can be bounded on the HDFS side 
 through tight timeouts, provide a toggle for users who want the WAL splitting 
 to continue only after the lease is truly recovered and the file is closed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642099#comment-13642099
 ] 

Varun Sharma commented on HBASE-8389:
-

How long is the recovery time - the 1st recovery will always be moot unless u 
have HDFS 4721.

Can you grep the namenode logs for 8428898362502069151 and paste them here ?

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642101#comment-13642101
 ] 

Varun Sharma commented on HBASE-8389:
-

One more question - you did apply these timeouts to both your HDFS datanodes 
and your accumulo DFS clients (restart) ? If yes, I expect the second recovery 
to succeed, hence you should recover in 2 attempts = 120 seconds

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642175#comment-13642175
 ] 

Varun Sharma commented on HBASE-8389:
-

Okay, I am done with testing for hbase using the above configuration - tight 
timeouts and stale node patches. I am using patch v5 and HBASE 8434 on top, to 
force lease recovery and not skip it.

The testing is done by doing kill -STOP server_process; kill -STOP 
datanode_process. I am forcing lease recovery so - I am applying HBASE 8434 
on top which basically means keep calling recoverLease until it returns true 
every 5 seconds. I have not found a single case where it takes more than 30-40 
seconds for recovery. The HDFS runs with 3703, 3912 and 4721 patches. So at 
some point the recovery succeeds within 1 second after a node becomes marked 
stale.

So, I am able to consistently get log splitting finished within the 1st minute 
and either by the 2nd or 3rd minute all regions are back online. I have tried 
this a sufficient number of times to convince me that it works.

Varun

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-25 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13642181#comment-13642181
 ] 

Varun Sharma commented on HBASE-8389:
-

[~ecn]

bq. Files recover as expected.

Can you elaborate - how many recovery attempts for success and also how long 
b/w retries ?

Varun

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-24 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640785#comment-13640785
 ] 

Varun Sharma commented on HBASE-8389:
-

Okay I did some testing with v5 and the MTTR was pretty good - 2-3 minutes - 
log splitting took 20-30 seconds - basically around 1 minute of this time was 
to replay the edits from the recovered_edits. This was with the stale node 
patches (3703 and 3912) however for HDFS. Also had tight dfs.socket.timeout=30. 
I mostly, suspended the Datanode and the region server packages at the same 
time. I also ran a test where I used iptables to firewall against all traffic 
to the host except ssh traffic.

However, the weird thing was, I also tried to reproduce the failure scenario 
above, with is setting the timeout at 1 second and I could not. I looked into 
the NN logs and this is what happened. Lease recovery was called the 1st time 
and a block recovery was initiated with the dead datanode (no HDFS 4721). Lease 
recovery was called the 2nd time and it returned true almost every time I ran 
these tests.

This is something that I did not see, the last time around. The logs I attached 
above show that a release recovery is called once by one SplitLogWorker, 
followed by 25 calls by another worker, followed by another 25 and eventually 
hundreds of calls the 3rd time. The 25 calls make sense since each split worker 
has a task level timeout of 25 seconds and we do recoverLease every second. 
Also there are 3 resubmissions, so the last worker is trying to get back the 
lease. I wonder if I hit a race condition which I can no longer reproduce, 
where one worker had the lease and did not give it up and subsequent workers 
just failed to recover the lease. In which case, 8354 is not the culprit but I 
still prefer the more relaxed timeout in this JIRA.

Also, I am now a little confused with lease recovery. It seems that lease 
recovery can be separate from block recovery. Basically, recover lease is 
called the first time, we enqueue a block recovery (which is never going to 
happen since we try to hit the dead datanode thats not heartbeating). However 
the 2nd call still returns true which confuses me since the block is still not 
finalized.

I wonder if lease recovery means anything other than, flipping something at the 
namenode saying who has the lease to the file. But its quite possible that the 
underlying block/file has not truly been recovered.

[~ecn]
Do you see something similar in your namenode logs as you kill, lease recovery 
initiated but no real block recovery/commitSynchronization messages (both 
regionserver + datanode) ? When we kill region server + datanode, we basically 
kill the primary or the first datanode which holds the block - this is the same 
datanode which would be chosen for block recovery..

Thanks
Varun




 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until 

[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-24 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13640925#comment-13640925
 ] 

Varun Sharma commented on HBASE-8389:
-

Thanks [~ecn]

This is something that I am not seeing in the tests I ran yesterday. The 2nd 
call almost always succeeds and returns null. Also, when the master calls 
recoverLease, are they corresponding CommitBlockSynchronization messages in the 
NN log ?

You may find this helpful - 
http://www.cyberciti.biz/tips/linux-iptables-4-block-all-incoming-traffic-but-allow-ssh.html



 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-24 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641101#comment-13641101
 ] 

Varun Sharma commented on HBASE-8389:
-

This is Hadoop 2.0.0 alpha CDH 4.2 - namenode logs - this all there is for this 
block
LOG LINE FOR BLOCK CREATION (NAMENODE)

2013-04-24 05:40:30,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
allocateBlock: 
/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238.
 BP-889095791-10.171.1.40-1366491606582 
blk_-2482251885029951704_11942{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], 
ReplicaUnderConstruction[10.168.12.138:50010|RBW], 
ReplicaUnderConstruction[10.170.6.131:50010|RBW]]}

LOG LINES FOR RECOVERY INITIATION (NAMENODE)

2013-04-24 06:14:43,623 INFO BlockStateChange: BLOCK* 
blk_-2482251885029951704_11942{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, 
replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], 
ReplicaUnderConstruction[10.168.12.138:50010|RBW], 
ReplicaUnderConstruction[10.170.6.131:50010|RBW]]} recovery started, 
primary=10.170.15.97:50010
2013-04-24 06:14:43,623 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
NameSystem.internalReleaseLease: File 
/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238
 has not been closed. Lease recovery is in progress. RecoveryId = 12012 for 
block blk_-2482251885029951704_11942{blockUCState=UNDER_RECOVERY, 
primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], 
ReplicaUnderConstruction[10.168.12.138:50010|RBW], 
ReplicaUnderConstruction[10.170.6.131:50010|RBW]]}

Note that the primary index is 0 - which is the datanode i killed. This was 
chosen as the primary DN for lease recovery. Obviously it will not work isnce 
the node is dead. But recoverLease returned true neverthless for the next call. 
Now I am not sure if that is expected behaviour since the real block recovery 
never happened.


 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we 

[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-24 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641106#comment-13641106
 ] 

Varun Sharma commented on HBASE-8389:
-

[~ecn]
Thanks Eric for checking (I guess we should have a similar data loss checker 
for hbase)

I wonder if you could look at your NN logs and see if you do get 
commitBlockSynchronization() log messages when the recoverLease method is 
called. I am trying to figure out why the block is not getting recovered and 
recoverlease is still returning true. These show up like this

2013-04-24 16:38:26,254 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
commitBlockSynchronization(lastblock=BP-889095791-10.171.1.40-1366491606582:blk_-2482251885029951704_11942,
 newgenerationstamp=12012, newlength=7044280, newtargets=[10.170.15.97:50010], 
closeFile=true, deleteBlock=false)

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-24 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641294#comment-13641294
 ] 

Varun Sharma commented on HBASE-8389:
-

There is no second call - because the 2nd call returns true - I am following up 
on this in HDFS 4721

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-24 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641467#comment-13641467
 ] 

Varun Sharma commented on HBASE-8389:
-

Alright, so it seems I have been stupid in running the recent tests. The lease 
recovery is correct in hadoop. I forgot what v5 patch exactly does, it reverts 
to old behaviour - I kept searching the namenode logs for multiple lease 
recoveries :)

HDFS timeouts (for region server and HDFS) - socket timeout = 3 seconds, socket 
write timeout = 5 seconds and ipc connect retries = 0 (timeout is hardcoded at 
20 seconds which is way too high)

I am summarizing each case:
1) After this patch,
When we split a log, we will do the following:
  a) Call recoverLease, which will enqueue a block recovery to the dead 
datanode, so a noop
  b) sleep 4 seconds
  c) Break the loop and access the file irrespective of whether recovery 
happened
  d) Sometimes fail but eventually get through

Note that lease recovery has not happened. If hbase finds a zero size hlog at 
any of the datanodes (the size is typically zero at the namenode since the file 
is not closed yet), it will error out and unassign the task, some other region 
server will pick up the split task. From the hbase console, I am always seeing 
non zero edits being split - so we are reading data. I am not sure if accumulo 
does similar checks for zero sized WALs, but [~ecn] will know better.

Since lease recovery has not happened, we risk data loss but it again depends 
on what kind of data loss accumulo sees, whether entire WAL(s) are lost or 
portions of WAL(s). If its entire WAL(s), maybe the zero sized check in HBase 
saves it from data loss. But if portions of WAL are being lost in accumulo when 
recoverLease return value is not checked, then we can have data loss after v5 
patch. Again I will let [~ecn] speak on that.

The good news though is that I am seeing pretty good MTTR in this case. Its 
typically 2-3 minutes and WAL splitting accounts for maybe 30-40 seconds. But 
note that I am running with HDFS 3912, 3703 and that my HDFS timeouts are 
configured to fail fast.

2) Before this patch but after 8354

We have the issue where lease recoveries pile up on the namenode faster than 
they can be served (every second), the side effect is that each latter recovery 
preempts the earlier one. Basically with HDFS it is simply not possible to get 
lease recovery within 4 seconds unless we use some of the stale node patches 
and really tighten all the HDFS timeouts and retries. So recoveries never 
finish in one second and they keep piling up and preempting earlier recoveries. 
Eventually we wait for 300 seconds, hbase.lease.recovery.timeout, after which 
we just open the file and mostly the last recovery has succeeded by then.

MTTR is not good in this case - at least 6 minutes for log splitting. On 
possibility could have been to reduce the number 300  seconds to maybe 20 
seconds. 

3) One can have the best of both worlds - a good MTTR and no/little data loss 
by opening files after real lease recovery has happened to avoid data 
corruption. For that, one would need to tune their HDFS timeouts to be low, the 
connect + socket timeouts, so that lease recoveries can happen within 5-10 
seconds. I think that, for such cases we should have a parameter, saying 
whether we want to force lease recovery before - I am going to raise a JIRA to 
discuss that configuration. Overall, if we had an isClosed() API life would be 
so much easier but a large number of hadoop releases do not have it, yet. I 
think this is more of a power user configuration but it probably makes sense to 
have one.

Thanks !

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery 

[jira] [Commented] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-23 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13639205#comment-13639205
 ] 

Varun Sharma commented on HBASE-8389:
-

Thank a lot for chiming in, Eric.

Its great to know that Accumulo also uses single WAL block - so indeed, 7878 is 
needed for HBase.

I may have used a strong word DDos here - it does not mean the NN experiences 
high CPU, high load or high network. Its just that it enters a vicious cycle of 
block recoveries (on 2.0.0-alpha)

Varun

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-23 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13639206#comment-13639206
 ] 

Varun Sharma commented on HBASE-8389:
-

FYI,

I am going to test the v5 of this patch today and report back... Thanks !

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8389) HBASE-8354 forces Namenode into loop with lease recovery requests

2013-04-23 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8389:


Summary: HBASE-8354 forces Namenode into loop with lease recovery requests  
(was: HBASE-8354 DDoSes Namenode with lease recovery requests)

 HBASE-8354 forces Namenode into loop with lease recovery requests
 -

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7295) Contention in HBaseClient.getConnection

2013-04-23 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-7295:


Attachment: volatile_output.txt
synchronized_output.txt
TestVolatile.java
TestSynchronized.java

 Contention in HBaseClient.getConnection
 ---

 Key: HBASE-7295
 URL: https://issues.apache.org/jira/browse/HBASE-7295
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.94.3
Reporter: Varun Sharma
Assignee: Varun Sharma
 Attachments: 7295-0.94.txt, 7295-0.94-v2.txt, 7295-0.94-v3.txt, 
 7295-0.94-v4.txt, 7295-0.94-v5.txt, 7295-trunk.txt, 7295-trunk.txt, 
 7295-trunk-v2.txt, 7295-trunk-v3.txt, 7295-trunk-v3.txt, 7295-trunk-v4.txt, 
 synchronized_output.txt, TestSynchronized.java, TestVolatile.java, 
 volatile_output.txt


 HBaseClient.getConnection() synchronizes on the connections object. We found 
 severe contention on a thrift gateway which was fanning out roughly 3000+ 
 calls per second to hbase region servers. The thrift gateway had 2000+ 
 threads for handling incoming connections. Threads were blocked on the 
 syncrhonized block - we set ipc.pool.size to 200. Since we are using 
 RoundRobin/ThreadLocal pool only - its not necessary to synchronize on 
 connections - it might lead to cases where we might go slightly over the 
 ipc.max.pool.size() but the additional connections would timeout after 
 maxIdleTime - underlying PoolMap connections object is thread safe.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7295) Contention in HBaseClient.getConnection

2013-04-23 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13639262#comment-13639262
 ] 

Varun Sharma commented on HBASE-7295:
-

I attached two files comparing performance of volatile and synchronized 
simulating a behaviour similar to HBaseClient using a PoolMapString, String

The tests basically run 1000 threads which issue 1 calls and each call 
performs an operation on PoolMap 30 times. So 300 million times.

Results
1) Synchronized - 90 seconds
2) Volatile - 16 seconds

As you can see, volatile is almost an order of magnitude faster.

That said, it seems like we can do this kind of locking almost 3 million times 
per second on a simple computer which is, for most cases, enough.

 Contention in HBaseClient.getConnection
 ---

 Key: HBASE-7295
 URL: https://issues.apache.org/jira/browse/HBASE-7295
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.94.3
Reporter: Varun Sharma
Assignee: Varun Sharma
 Attachments: 7295-0.94.txt, 7295-0.94-v2.txt, 7295-0.94-v3.txt, 
 7295-0.94-v4.txt, 7295-0.94-v5.txt, 7295-trunk.txt, 7295-trunk.txt, 
 7295-trunk-v2.txt, 7295-trunk-v3.txt, 7295-trunk-v3.txt, 7295-trunk-v4.txt, 
 synchronized_output.txt, TestSynchronized.java, TestVolatile.java, 
 volatile_output.txt


 HBaseClient.getConnection() synchronizes on the connections object. We found 
 severe contention on a thrift gateway which was fanning out roughly 3000+ 
 calls per second to hbase region servers. The thrift gateway had 2000+ 
 threads for handling incoming connections. Threads were blocked on the 
 syncrhonized block - we set ipc.pool.size to 200. Since we are using 
 RoundRobin/ThreadLocal pool only - its not necessary to synchronize on 
 connections - it might lead to cases where we might go slightly over the 
 ipc.max.pool.size() but the additional connections would timeout after 
 maxIdleTime - underlying PoolMap connections object is thread safe.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-22 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637792#comment-13637792
 ] 

Varun Sharma commented on HBASE-8389:
-

Hi Ted,

+1 on patch for 0.94

This basically retains the old behaviour prior to hbase 7878. For people who 
run clusters with tight hdfs timeouts and know that lease/block recovery for 
them shall occur within x seconds, they can make use of the configurable retry 
interval to wait for the real recovery to happen.

I feel that HBase 7878 does not cause data loss for hbase but I am not sure. My 
theory is that for HBase, size of 1 WAL is  size of one HDFS block before it 
is rolled. Hence each rolled WAL has one block and one on which the lease is 
being held contains 1 block under_recovery/under_construction - that file has a 
size=0 in the namenode until the lease recovery is complete - this is because 
there are no finalized blocks for the WAL. So, when we do the following, try to 
replay the WAL without lease recovery being complete, we get a file of size 0 
from the namenode.

From the region server logs, it seems that we do not take size 0 for the file 
as the truth but instead treat it as a failure until we get a WAL file size  
0.

However, if the size of the WAL is  HDFS block size then it is possible that 
some HDFS blocks have been finalized and we get a file size = 0 because of the 
finalized blocks. In that case we could end up throwing away the last block 
belonging to the WAL. Maybe that is why this was observed for accumulo but not 
for HBase.

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-trunk-v1.txt, 8389-trunk-v2.patch, 
 nn1.log, nn.log, sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8284) Allow String Offset(s) in ColumnPaginationFilter for bookmark based pagination

2013-04-22 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13638468#comment-13638468
 ] 

Varun Sharma commented on HBASE-8284:
-

Friendly, ping...

Thanks!

 Allow String Offset(s) in ColumnPaginationFilter for bookmark based pagination
 --

 Key: HBASE-8284
 URL: https://issues.apache.org/jira/browse/HBASE-8284
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.6.1
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor
 Fix For: 0.98.0, 0.94.8, 0.95.1

 Attachments: 8284-0.94.txt, 8284-0.94-v2.txt, 8284-0.95.txt, 
 8284-trunk.txt


 Attaching from email to HBase user mailing list:
 I am thinking of adding a string offset to ColumnPaginationFilter. There are 
 two reasons:
 1) For deep pagination, you can seek using SEEK_NEXT_USING_HINT.
 2) For correctness reasons, this approach is better if the list of columns is 
 mutation. Lets say you get 1st 50 columns using the current approach. In the 
 mean time some columns are inserted amongst the 1st 50 columns. Now you 
 request the 2nd set of 50 columns. Chances are that you will have duplicates 
 amongst the 2 sets (1st 50 and 2nd 50). If instead you used the last column 
 of the 1st 50 as a string offset for getting the 2nd set of columns, the 
 chances of getting dups is significantly lower.
 This becomes important for user facing interactive applications. Particularly 
 where consistency etc. are not as important since those are best effort 
 services. But showing duplicates across pages is pretty bad.
 Please let me know if this makes sense and is feasible. Basically, I would 
 like a string offset passed to ColumnPaginationFilter as an alternative 
 constructor. If the string offset is supplied, then, I would like to seek to 
 either the column supplied or if the column is deleted, seek to the column 
 just greater than the supplied column.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-22 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13638492#comment-13638492
 ] 

Varun Sharma commented on HBASE-8389:
-

My feeling is that this will not truly work because we will get stuck.

We call recoverLease - lease recovery gets added to primary datanode which is 
already dead. Now, we keep calling isClosed() but hte file never closes since 
the lease recovery does not really start (unless we have something like HDFS 
4721).

Eventually, I suspect there is a timeout for how long HLog tasks can be 
outstanding.

Varun

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-22 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13638551#comment-13638551
 ] 

Varun Sharma commented on HBASE-8389:
-

I think isFileClosed() loop basically assumes that recoverLease() has started 
the recovery. But it is quite possible that recoverLease() has never started 
the recovery. So we need to be a little more careful with the interactions b/w 
recoverLease and isFileClosed().

For now we can perhaps stick to v5. I think the behaviour there would be, for a 
cluster run with default settings:
a) recoverLease every 4 seconds
b) HLog split timeout expires
c) Let the task bounce back and forth b/w region servers

The only way to fix this is to configure the HDFS cluster in a way such that 
lease recovery finishes within 4 seconds (like dfs.socket.timeout=3000 and 
connect timeouts to be low enough). I am going to test out some of these 
combinations today on 0.94

Thanks
Varun

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Bug
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8389) HBase 8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)
Varun Sharma created HBASE-8389:
---

 Summary: HBase 8354 DDoSes Namenode with lease recovery requests
 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
 Environment: We ran hbase 0.94.3 patched with 8354 and observed too 
many outstanding lease recoveries because of the short retry interval of 1 
second between lease recoveries.

The namenode gets into the following loop:
1) Receives lease recovery request and initiates recovery choosing a primary 
datanode every second
2) A lease recovery is successful and the namenode tries to commit the block 
under recovery as finalized - this takes  10 seconds in our environment since 
we run with tight HDFS socket timeouts.
3) At step 2), there is a more recent recovery enqueued because of the 
aggressive retries. This causes the committed block to get preempted and we 
enter a vicious cycle

So we do,  initiate_recovery -- commit_block -- 
commit_preempted_by_another_recovery

This loop is paused after 300 seconds which is the 
hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
detection timeout is 20 seconds.

Note that before the patch, we do not call recoverLease so aggressively - also 
it seems that the HDFS namenode is pretty dumb in that it keeps initiating new 
recoveries for every call. Before the patch, we call recoverLease, assume that 
the block was recovered, try to get the file, it has zero length since its 
under recovery, we fail the task and retry until we get a non zero length. So 
things just work.

Fixes:
1) Expecting recovery to occur within 1 second is too aggressive. We need to 
have a more generous timeout. The timeout needs to be configurable since 
typically, the recovery takes as much time as the DFS timeouts. The primary 
datanode doing the recovery tries to reconcile the blocks and hits the timeouts 
when it tries to contact the dead node. So the recovery is as fast as the HDFS 
timeouts.

2) We have another issue I report in HDFS 4721. The Namenode chooses the stale 
datanode to perform the recovery (since its still alive). Hence the first 
recovery request is bound to fail. So if we want a tight MTTR, we either need 
something like HDFS 4721 or we need something like this

  recoverLease(...)
  sleep(1000)
  recoverLease(...)
  sleep(configuredTimeout)
  recoverLease(...)
  sleep(configuredTimeout)

Where configuredTimeout should be large enough to let the recovery happen but 
the first timeout is short so that we get past the moot recovery in step #1.
 

Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8389) HBase 8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8389:


Attachment: nn.log

 HBase 8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
 Environment: We ran hbase 0.94.3 patched with 8354 and observed too 
 many outstanding lease recoveries because of the short retry interval of 1 
 second between lease recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: nn.log




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBase 8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637606#comment-13637606
 ] 

Varun Sharma commented on HBASE-8389:
-

Attached the Namenode logs showing a large number of recoveries in progress...

1) nn.log - showing a huge number of initiated recoveries
2) nn1.log - showing a huge number of block finalization/commit failures

We ran hbase with an increased sleep period b/w recoveries - 25 seconds and the 
time to recovery came down substantially.

 HBase 8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
 Environment: We ran hbase 0.94.3 patched with 8354 and observed too 
 many outstanding lease recoveries because of the short retry interval of 1 
 second between lease recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: nn1.log, nn.log




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8389) HBase 8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8389:


Attachment: nn1.log

 HBase 8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
 Environment: We ran hbase 0.94.3 patched with 8354 and observed too 
 many outstanding lease recoveries because of the short retry interval of 1 
 second between lease recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: nn1.log, nn.log




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8389:


Attachment: sample.patch

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
 Environment: We ran hbase 0.94.3 patched with 8354 and observed too 
 many outstanding lease recoveries because of the short retry interval of 1 
 second between lease recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: nn1.log, nn.log, sample.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637610#comment-13637610
 ] 

Varun Sharma commented on HBASE-8389:
-

Attached sample patch which works in our setup. We run with 
dfs.socket.timeout=3000 and dfs.socket.write.timeout=5000 - so recovery 
typically takes  20 seconds since there is 1 WAL and we recover only 1 block.

If you closely observe the logs - the first useful recovery starts @49:05 since 
the first recovery chooses the dead datanode as the primary DN to do the 
recovery and first commit block is around 49:08 - hence, the recovery is 
finished within 3 seconds - this is the same as dfs.socket.timeout which is 3 
seconds (the primary DN times out on the dead DN while trying to reconcile the 
replicas).

I believe if we do not pick stale node replicas ( 20 second heart beat) as 
primary DN(s) and when we choose a non stale replica as the primary DN, we do 
not reconcile blocks against the stale replica, we can get the lease recovery 
to finish under 1 second. Currently that is not the case. 

Varun

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
 Environment: We ran hbase 0.94.3 patched with 8354 and observed too 
 many outstanding lease recoveries because of the short retry interval of 1 
 second between lease recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: nn1.log, nn.log, sample.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8389:


Attachment: 8389-trunk-v2.patch

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8389-trunk-v1.txt, 8389-trunk-v2.patch, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637645#comment-13637645
 ] 

Varun Sharma commented on HBASE-8389:
-

Ted, thanks for the patch - I just attached a v2 with the corrected comments.

Basically, there are two things here:
a) The DDoS is independent of whether the namenode chose the stale data node as 
the primary DN to do the recovery. All it needs is a slower recovery time than 
the retry interval. Because then recoveries pile up faster than they actually 
complete. As a result, any recovery that succeeds gets preempted by a recovery 
that starts later. So it needs to be as big as the HDFS underlying timeout.
b) The very first recoverLease call is always a no-op. In fact every third call 
is a no-op since the NN chooses DN1--DN2--DN3--DN1 in a cyclic order to do 
the recoveries. Note that DN1 is the dead datanode here.

Currently, I think it will take 900 seconds for the cluster to recover if it 
accepts write traffic across all region servers.

Varun


 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma
 Fix For: 0.94.8

 Attachments: 8389-trunk-v1.txt, 8389-trunk-v2.patch, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637665#comment-13637665
 ] 

Varun Sharma commented on HBASE-8389:
-

I believe the test failed because i increased the default retry interval in 
patch v2.

I think before HBASE 8354 - we essentially have the following situation:

1) Split task picked up by region server - recover lease called
2) Ignore return value
3) Try to read file and get a file length=0 and sometimes try to grab the 0 
length file from a DN
4) Mostly fail because of 0 length file or because the DN has not finalized 
the block and its under recovery
5) Task unassigned and bounce back and forth b/w multiple region servers (I see 
multiple region servers holdign the same task sometimes)

This process is equally bad - multi minute recovery (not sure exactly how long)

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-trunk-v1.txt, 8389-trunk-v2.patch, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637688#comment-13637688
 ] 

Varun Sharma commented on HBASE-8389:
-

Hi Ted,

Seems like lease recovery is the real thorn when it comes to recovery for 
HBase. The stale node detection patches work very well for splitting the 
finalized WAL but not the WAL being currently written into. I basically see a 
very long time to recovery because it always takes a long time for HDFS with 
the stock timeouts.

I am trying out a patch for HDFS 4721 which basically avoids all those 
datanodes which have not heart beated for 20 seconds during block recovery. 
That seems to enable me to recover the block within 1 second. With that fix, we 
can survive loss of a single datanode and recover the last WAL within 2-3 
seconds.

I have not heard from the HDFS community on it yet but I think once we lose a 
datanode - that should not be chosen as the Primary Datanode for lease recovery 
nor for block reconciliation...

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-trunk-v1.txt, 8389-trunk-v2.patch, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8389) HBASE-8354 DDoSes Namenode with lease recovery requests

2013-04-21 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637701#comment-13637701
 ] 

Varun Sharma commented on HBASE-8389:
-

Cool - I attached a rough patch with some logs to show how it allows lease 
recovery within 1-2 seconds. Now waiting for input from the HDFS community.

Do we know what the status was w.r.t to lease recovery before HBASE 8354 - we 
call recoverLease and access the file - mostly error out and then do we retry 
or do we unassign the zk task for the split ?

Varun

 HBASE-8354 DDoSes Namenode with lease recovery requests
 ---

 Key: HBASE-8389
 URL: https://issues.apache.org/jira/browse/HBASE-8389
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Critical
 Fix For: 0.94.8

 Attachments: 8389-trunk-v1.txt, 8389-trunk-v2.patch, nn1.log, nn.log, 
 sample.patch


 We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
 recoveries because of the short retry interval of 1 second between lease 
 recoveries.
 The namenode gets into the following loop:
 1) Receives lease recovery request and initiates recovery choosing a primary 
 datanode every second
 2) A lease recovery is successful and the namenode tries to commit the block 
 under recovery as finalized - this takes  10 seconds in our environment 
 since we run with tight HDFS socket timeouts.
 3) At step 2), there is a more recent recovery enqueued because of the 
 aggressive retries. This causes the committed block to get preempted and we 
 enter a vicious cycle
 So we do,  initiate_recovery -- commit_block -- 
 commit_preempted_by_another_recovery
 This loop is paused after 300 seconds which is the 
 hbase.lease.recovery.timeout. Hence the MTTR we are observing is 5 minutes 
 which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
 detection timeout is 20 seconds.
 Note that before the patch, we do not call recoverLease so aggressively - 
 also it seems that the HDFS namenode is pretty dumb in that it keeps 
 initiating new recoveries for every call. Before the patch, we call 
 recoverLease, assume that the block was recovered, try to get the file, it 
 has zero length since its under recovery, we fail the task and retry until we 
 get a non zero length. So things just work.
 Fixes:
 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
 have a more generous timeout. The timeout needs to be configurable since 
 typically, the recovery takes as much time as the DFS timeouts. The primary 
 datanode doing the recovery tries to reconcile the blocks and hits the 
 timeouts when it tries to contact the dead node. So the recovery is as fast 
 as the HDFS timeouts.
 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
 stale datanode to perform the recovery (since its still alive). Hence the 
 first recovery request is bound to fail. So if we want a tight MTTR, we 
 either need something like HDFS 4721 or we need something like this
   recoverLease(...)
   sleep(1000)
   recoverLease(...)
   sleep(configuredTimeout)
   recoverLease(...)
   sleep(configuredTimeout)
 Where configuredTimeout should be large enough to let the recovery happen but 
 the first timeout is short so that we get past the moot recovery in step #1.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8362) Possible MultiGet optimization

2013-04-18 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634884#comment-13634884
 ] 

Varun Sharma commented on HBASE-8362:
-

Or add a new API and retain the older API for the exotic 1 % ?

 Possible MultiGet optimization
 --

 Key: HBASE-8362
 URL: https://issues.apache.org/jira/browse/HBASE-8362
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl

 Currently MultiGets are executed on a RegionServer in a single thread in a 
 loop that handles each Get separately (opening a scanner, seeking, etc).
 It seems we could optimize this (per region at least) by opening a single 
 scanner and issue a reseek for each Get that was requested.
 I have not tested this yet and no patch, but I would like to solicit feedback 
 on this idea.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8362) Possible MultiGet optimization

2013-04-17 Thread Varun Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634518#comment-13634518
 ] 

Varun Sharma commented on HBASE-8362:
-

I also think this would be better.

Apologies but I did not completely understand the reseek comment, does that 
mean we could seek to a row  the value of the seek hint ?

Also, I wonder if an implementation via a filter could be possible. It might 
allow this technique of sorting the input and reseeking to scans (wherever it 
may make sense) unless we already have something for scans which does this. I 
think fuzzyRowFilter might be doing something similar to this (it seems to do 
more).

I am really, looking forward to testing an implementation on a setup we have (6 
region server with 0.94, no blooms, block cache @64k with multi gets of 50 
rows). 

 Possible MultiGet optimization
 --

 Key: HBASE-8362
 URL: https://issues.apache.org/jira/browse/HBASE-8362
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl

 Currently MultiGets are executed on a RegionServer in a single thread in a 
 loop that handles each Get separately (opening a scanner, seeking, etc).
 It seems we could optimize this (per region at least) by opening a single 
 scanner and issue a reseek for each Get that was requested.
 I have not tested this yet and no patch, but I would like to solicit feedback 
 on this idea.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8370) Report data block cache hit rates apart from aggregate cache hit rates

2013-04-17 Thread Varun Sharma (JIRA)
Varun Sharma created HBASE-8370:
---

 Summary: Report data block cache hit rates apart from aggregate 
cache hit rates
 Key: HBASE-8370
 URL: https://issues.apache.org/jira/browse/HBASE-8370
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Priority: Minor


Attaching from mail to d...@hbase.apache.org

I am wondering whether the HBase cachingHitRatio metrics that the region server 
UI shows, can get me a break down by data blocks. I always see this number to 
be very high and that could be exagerated by the fact that each lookup hits the 
index blocks and bloom filter blocks in the block cache before retrieving the 
data block. This could be artificially bloating up the cache hit ratio.

Assuming the above is correct, do we already have a cache hit ratio for data 
blocks alone which is more obscure ? If not, my sense is that it would be 
pretty valuable to add one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8370) Report data block cache hit rates apart from aggregate cache hit rates

2013-04-17 Thread Varun Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Sharma updated HBASE-8370:


Assignee: Varun Sharma

 Report data block cache hit rates apart from aggregate cache hit rates
 --

 Key: HBASE-8370
 URL: https://issues.apache.org/jira/browse/HBASE-8370
 Project: HBase
  Issue Type: Improvement
Reporter: Varun Sharma
Assignee: Varun Sharma
Priority: Minor

 Attaching from mail to d...@hbase.apache.org
 I am wondering whether the HBase cachingHitRatio metrics that the region 
 server UI shows, can get me a break down by data blocks. I always see this 
 number to be very high and that could be exagerated by the fact that each 
 lookup hits the index blocks and bloom filter blocks in the block cache 
 before retrieving the data block. This could be artificially bloating up the 
 cache hit ratio.
 Assuming the above is correct, do we already have a cache hit ratio for data 
 blocks alone which is more obscure ? If not, my sense is that it would be 
 pretty valuable to add one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   >