date:20120111


[ 
https://issues.apache.org/jira/browse/HBASE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183969#comment-13183969
 ] 

Hudson commented on HBASE-5152:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5152  Region is on service before completing initialization when 
doing rollback of split,
   it will affect read correctness (Chunhui)

tedyu : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java


 Region is on service before completing initialization when doing rollback of 
 split, it will affect read correctness 
 

 Key: HBASE-5152
 URL: https://issues.apache.org/jira/browse/HBASE-5152
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.92.0, 0.94.0

 Attachments: 5152-v2.txt, hbase-5152.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5052) The path where a dynamically loaded coprocessor jar is copied on the local file system depends on the region name (and implicitly, the start key)


[ 
https://issues.apache.org/jira/browse/HBASE-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183971#comment-13183971
 ] 

Hudson commented on HBASE-5052:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5052 The path where a dynamically loaded coprocessor jar is copied on 
the local file system depends on the region name (and implicitly, the start key)

stack : 
Files : 
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/RegionCoprocessorHost.java


 The path where a dynamically loaded coprocessor jar is copied on the local 
 file system depends on the region name (and implicitly, the start key)
 -

 Key: HBASE-5052
 URL: https://issues.apache.org/jira/browse/HBASE-5052
 Project: HBase
  Issue Type: Bug
  Components: coprocessors
Affects Versions: 0.92.0
Reporter: Andrei Dragomir
Assignee: Andrei Dragomir
 Fix For: 0.92.0

 Attachments: HBASE-5052.patch


 When loading a coprocessor from hdfs, the jar file gets copied to a path on 
 the local filesystem, which depends on the region name, and the region start 
 key. The name is cleaned, but not enough, so when you have filesystem 
 unfriendly characters (/?:, etc), the coprocessor is not loaded, and an error 
 is thrown

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5121) MajorCompaction may affect scan's correctness


[ 
https://issues.apache.org/jira/browse/HBASE-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183972#comment-13183972
 ] 

Hudson commented on HBASE-5121:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5121 MajorCompaction may affect scan's correctness (chunhui shen and 
Lars H)

larsh : 
Files : 
* /hbase/trunk/CHANGES.txt
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestScanner.java


 MajorCompaction may affect scan's correctness
 -

 Key: HBASE-5121
 URL: https://issues.apache.org/jira/browse/HBASE-5121
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.90.4
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.94.0, 0.92.1

 Attachments: 5121-0.92.txt, 5121-suggest.txt, 
 5121-trunk-combined.txt, 5121.90, hbase-5121-testcase.patch, 
 hbase-5121.patch, hbase-5121v2.patch


 In our test, there are two families' keyvalue for one row.
 But we could find a infrequent problem when doing scan's next if 
 majorCompaction happens concurrently.
 In the client's two continuous doing scan.next():
 1.First time, scan's next returns the result where family A is null.
 2.Second time, scan's next returns the result where family B is null.
 The two next()'s result have the same row.
 If there are more families, I think the scenario will be more strange...
 We find the reason is that storescanner.peek() is changed after 
 majorCompaction if there are delete type KeyValue.
 This change causes the PriorityQueueKeyValueScanner of RegionScanner's heap 
 is not sure to be sorted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5141) Memory leak in MonitoredRPCHandlerImpl


[ 
https://issues.apache.org/jira/browse/HBASE-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183970#comment-13183970
 ] 

Hudson commented on HBASE-5141:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5141 Memory leak in MonitoredRPCHandlerImpl -- REDO
HBASE-5141 Memory leak in MonitoredRPCHandlerImpl -- REVERT. OVER-COMMITTED.  
REVERTING ALL SO CAN REDO COMMIT
HBASE-5141 Memory leak in MonitoredRPCHandlerImpl

stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/monitoring/MonitoredRPCHandlerImpl.java

stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/monitoring/MonitoredRPCHandlerImpl.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java

stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/monitoring/MonitoredRPCHandlerImpl.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java


 Memory leak in MonitoredRPCHandlerImpl
 --

 Key: HBASE-5141
 URL: https://issues.apache.org/jira/browse/HBASE-5141
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
Priority: Blocker
 Fix For: 0.92.0, 0.94.0

 Attachments: HBASE-5141-v2.patch, HBASE-5141.patch, Screen Shot 
 2012-01-06 at 3.03.09 PM.png


 I got a pretty reliable way of OOME'ing my region servers. Using a big 
 payload (64MB in my case), a default heap and default number of handlers, 
 it's not too long that all the MonitoredRPCHandlerImpl hold on a 64MB 
 reference and once a compaction kicks in it kills everything.
 The issue is that even after the RPC call is done, the packet still lives in 
 MonitoredRPCHandlerImpl.
 Will attach a screen shot of jprofiler's analysis in a moment.
 This is a blocker for 0.92.0, anyone using a high number of handlers and 
 bigish values will kill themselves.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error


[ 
https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183973#comment-13183973
 ] 

Hudson commented on HBASE-5041:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5041  Major compaction on non existing table does not throw error 
(Shrijeet)

tedyu : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java


 Major compaction on non existing table does not throw error 
 

 Key: HBASE-5041
 URL: https://issues.apache.org/jira/browse/HBASE-5041
 Project: HBase
  Issue Type: Bug
  Components: regionserver, shell
Affects Versions: 0.90.3
Reporter: Shrijeet Paliwal
Assignee: Shrijeet Paliwal
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 
 0002-HBASE-5041-Throw-error-if-table-does-not-exist.patch, 
 0002-HBASE-5041-Throw-error-if-table-does-not-exist.patch, 
 0003-HBASE-5041-Throw-error-if-table-does-not-exist.0.90.patch


 Following will not complain even if fubar does not exist
 {code}
 echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell
 {code}
 The downside for this defect is that major compaction may be skipped due to
 a typo by Ops.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5134) Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface


[ 
https://issues.apache.org/jira/browse/HBASE-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183974#comment-13183974
 ] 

Hudson commented on HBASE-5134:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5134 Remove getRegionServerWithoutRetries and 
getRegionServerWithRetries from HConnection Interface

stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HBaseConfiguration.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/ConnectionUtils.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnection.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/MetaScanner.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/ServerCallable.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/ExecRPCInvoker.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/HConnectionTestingUtility.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFilesSplitRecovery.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestCatalogJanitor.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java


 Remove getRegionServerWithoutRetries and getRegionServerWithRetries from 
 HConnection Interface
 --

 Key: HBASE-5134
 URL: https://issues.apache.org/jira/browse/HBASE-5134
 Project: HBase
  Issue Type: Improvement
Reporter: stack
Assignee: stack
 Fix For: 0.94.0

 Attachments: 5134-v2.txt, 5134-v3.txt, 5134-v4.txt, 5134-v5.txt, 
 5134-v6.txt, 5134-v6.txt


 Its broke having these meta methods in HConnection.  They take 
 ServerCallables which themselves have HConnections inevitably.   It makes for 
 a tangle in the model and frustrates being able to do mocked implemenations 
 of HConnection.  These methods better belong in something like 
 HConnectionManager, or elsewhere altogether.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5172) HTableInterface should extend java.io.Closeable


[ 
https://issues.apache.org/jira/browse/HBASE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183976#comment-13183976
 ] 

Hudson commented on HBASE-5172:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5172 HTableInterface should extend java.io.Closeable

stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HTableInterface.java


 HTableInterface should extend java.io.Closeable
 ---

 Key: HBASE-5172
 URL: https://issues.apache.org/jira/browse/HBASE-5172
 Project: HBase
  Issue Type: Bug
Reporter: Zhihong Yu
Assignee: stack
 Fix For: 0.94.0

 Attachments: 5172.txt


 Ioan Eugen Stan found this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5173) Commit hbase-4480 findHangingTest.sh script under dev-support


[ 
https://issues.apache.org/jira/browse/HBASE-5173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183979#comment-13183979
 ] 

Hudson commented on HBASE-5173:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5173 Commit hbase-4480 findHangingTest.sh script under dev-support

stack : 
Files : 
* /hbase/trunk/dev-support/findHangingTest.sh


 Commit hbase-4480 findHangingTest.sh script under dev-support
 -

 Key: HBASE-5173
 URL: https://issues.apache.org/jira/browse/HBASE-5173
 Project: HBase
  Issue Type: Task
Reporter: stack
 Fix For: 0.94.0

 Attachments: 5173.txt


 See hbase-4480 for the script from Ted

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5088) A concurrency issue on SoftValueSortedMap


[ 
https://issues.apache.org/jira/browse/HBASE-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183980#comment-13183980
 ] 

Hudson commented on HBASE-5088:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5088  addendum
HBASE-5088 A concurrency issue on SoftValueSortedMap (Jieshan Bean and Lars H)

larsh : 
Files : 
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/SoftValueSortedMap.java

larsh : 
Files : 
* /hbase/trunk/CHANGES.txt
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/SoftValueSortedMap.java


 A concurrency issue on SoftValueSortedMap
 -

 Key: HBASE-5088
 URL: https://issues.apache.org/jira/browse/HBASE-5088
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.4, 0.94.0
Reporter: Jieshan Bean
Assignee: Lars Hofhansl
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5088-0.90.txt, 5088-0.92-trunk-addendum.txt, 
 5088-final3.txt, HBase-5088-90.patch, HBase-5088-trunk.patch, 
 HBase5088-90-replaceSoftValueSortedMap.patch, 
 HBase5088-90-replaceTreeMap.patch, HBase5088-trunk-replaceTreeMap.patch, 
 HBase5088Reproduce.java, PerformanceTestResults.png


 SoftValueSortedMap is backed by a TreeMap. All the methods in this class are 
 synchronized. If we use this method to add/delete elements, it's ok.
 But in HConnectionManager#getCachedLocation, it use headMap to get a view 
 from SoftValueSortedMap#internalMap. Once we operate 
 on this view map(like add/delete) in other threads, a concurrency issue may 
 occur.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4480) Testing script to simplify local testing


[ 
https://issues.apache.org/jira/browse/HBASE-4480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183977#comment-13183977
 ] 

Hudson commented on HBASE-4480:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5173 Commit hbase-4480 findHangingTest.sh script under dev-support


 Testing script to simplify local testing
 

 Key: HBASE-4480
 URL: https://issues.apache.org/jira/browse/HBASE-4480
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.4
Reporter: Jesse Yates
Priority: Minor
  Labels: test
 Fix For: 0.94.0

 Attachments: HBASE-4480.patch, HBASE-4480_v2.patch, 
 HBASE-4480_v3.patch, HBASE-4480_v4.patch, findHangingTest.sh, 
 runtest-no-npe-check.sh, runtest.sh, runtest2.sh


 As mentioned by http://search-hadoop.com/m/r2Ab624ES3e and 
 http://search-hadoop.com/m/cZjDH1ykGIA it would be nice if we could have a 
 script that would handle more of the finer points of running/checking our 
 test suite.
 This script should:
 (1) Allow people to determine which tests are hanging/taking a long time to 
 run
 (2) Allow rerunning of particular tests to make sure it wasn't an artifact of 
 running the whole suite that caused the failure
 (3) Allow people to specify to run just unit tests or also integration tests 
 (essentially wrapping calls to 'maven test' and 'maven verify').
 This script should just be a convenience script - running tests directly from 
 maven should not be impacted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException


[ 
https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183978#comment-13183978
 ] 

Hudson commented on HBASE-5137:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-5137 MasterFileSystem.splitLog() should abort even if 
waitOnSafeMode() throws IOException(Ram  Ted)

ramkrishna : 
Files : 
* /hbase/trunk/CHANGES.txt
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java


 MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws 
 IOException
 

 Key: HBASE-5137
 URL: https://issues.apache.org/jira/browse/HBASE-5137
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 0.92.0, 0.90.6

 Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch


 I am not sure if this bug was already raised in JIRA.
 In our test cluster we had a scenario where the RS had gone down and 
 ServerShutDownHandler started with splitLog.
 But as the HDFS was down the check waitOnSafeMode throws IOException.
 {code}
 try {
 // If FS is in safe mode, just wait till out of it.
 FSUtils.waitOnSafeMode(conf,
   conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000));  
 splitter.splitLog();
   } catch (OrphanHLogAfterSplitException e) {
 {code}
 We catch the exception
 {code}
 } catch (IOException e) {
   checkFileSystem();
   LOG.error(Failed splitting  + logDir.toString(), e);
 }
 {code}
 So the HLog split itself did not happen. We encontered like 4 regions that 
 was recently splitted in the crashed RS was lost.
 Can we abort the Master in such scenarios? Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3949) Add Master link to RegionServer pages


[ 
https://issues.apache.org/jira/browse/HBASE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183975#comment-13183975
 ] 

Hudson commented on HBASE-3949:
---

Integrated in HBase-TRUNK #2617 (See 
[https://builds.apache.org/job/HBase-TRUNK/2617/])
HBASE-3949. Add Master link to RegionServer pages. Contributed by Gregory 
Chanan.

todd : 
Files : 
* 
/hbase/trunk/src/main/jamon/org/apache/hbase/tmpl/regionserver/RSStatusTmpl.jamon
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSStatusServlet.java


 Add Master link to RegionServer pages
 ---

 Key: HBASE-3949
 URL: https://issues.apache.org/jira/browse/HBASE-3949
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.90.3, 0.92.0
Reporter: Lars George
Assignee: Gregory Chanan
Priority: Minor
  Labels: noob
 Fix For: 0.94.0


 Use the ZK info where the master is to add a UI link on the top of each 
 RegionServer page. Currently you cannot navigate directly to the Master UI 
 once you are on a RS page.
 Not sure if the info port is exposed OTTOMH, but we could either use the RS 
 local config setting for that or add it to ZK to enable lookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-11 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183989#comment-13183989
]

Hadoop QA commented on HBASE-5120:
--

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12510165/HBASE-5120_4.patch
against trunk revision .

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified
tests.
Please justify why no new tests are needed for this
patch.
Also please list what manual steps were performed to
verify this patch.

-1 javadoc. The javadoc tool appears to have generated -147 warning
messages.

+1 javac. The applied patch does not increase the total number of javac
compiler warnings.

-1 findbugs. The patch appears to introduce 79 new Findbugs (version
1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of
release audit warnings.

-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
org.apache.hadoop.hbase.io.hfile.TestLruBlockCache
org.apache.hadoop.hbase.mapred.TestTableMapReduce
org.apache.hadoop.hbase.mapreduce.TestImportTsv

Test results:
https://builds.apache.org/job/PreCommit-HBASE-Build/726//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/726//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/726//console

This message is automatically generated.

Timeout monitor races with table disable handler

Key: HBASE-5120
URL: https://issues.apache.org/jira/browse/HBASE-5120
Project: HBase
Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Priority: Blocker
Fix For: 0.94.0, 0.92.1

Attachments: HBASE-5120.patch, HBASE-5120_1.patch,
HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch

Here is what J-D described here:
https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
I think I will retract from my statement that it used to be extremely racy
and caused more troubles than it fixed, on my first test I got a stuck
region in transition instead of being able to recover. The timeout was set to
2 minutes to be sure I hit it.
First the region gets closed
{quote}
2012-01-04 00:16:25,811 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to
sv4r5s38,62023,1325635980913 for region
test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
{quote}
2 minutes later it times out:
{quote}
2012-01-04 00:18:30,026 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
state=PENDING_CLOSE, ts=1325636185810, server=null
2012-01-04 00:18:30,026 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
2012-01-04 00:18:30,027 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
(offlining)
{quote}
100ms later the master finally gets the event:
{quote}
2012-01-04 00:18:30,129 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Handling
transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913,
region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
2012-01-04 00:18:30,129 DEBUG
org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED
event for 1a4b111bcc228043e89f59c4c3f6a791
2012-01-04 00:18:30,129 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so
deleting ZK node and removing from regions in transition, skipping assignment
of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:62003-0x134589d3db03587 Deleting existing unassigned node for
1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:62003-0x134589d3db03587 Successfully deleted unassigned node for
region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
{quote}
At this point everything is fine, the region was

[jira] [Updated] (HBASE-5153) HConnection re-creation in HTable after HConnection abort

2012-01-11 Thread Jieshan Bean (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jieshan Bean updated HBASE-5153:


Attachment: HBASE-5153-V3.patch

 HConnection re-creation in HTable after HConnection abort
 -

 Key: HBASE-5153
 URL: https://issues.apache.org/jira/browse/HBASE-5153
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.4
Reporter: Jieshan Bean
Assignee: Jieshan Bean
 Fix For: 0.90.6

 Attachments: HBASE-5153-V2.patch, HBASE-5153-V3.patch, 
 HBASE-5153.patch


 HBASE-4893 is related to this issue. In that issue, we know, if multi-threads 
 share a same connection, once this connection got abort in one thread, the 
 other threads will got a 
 HConnectionManager$HConnectionImplementation@18fb1f7 closed exception.
 It solve the problem of stale connection can't removed. But the orignal 
 HTable instance cann't be continue to use. The connection in HTable should be 
 recreated.
 Actually, there's two aproach to solve this:
 1. In user code, once catch an IOE, close connection and re-create HTable 
 instance. We can use this as a workaround.
 2. In HBase Client side, catch this exception, and re-create connection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5128) [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online.

[
https://issues.apache.org/jira/browse/HBASE-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184031#comment-13184031
]

jirapos...@reviews.apache.org commented on HBASE-5128:
--

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3435/
---

(Updated 2012-01-11 12:46:37.524636)

Review request for hbase, Todd Lipcon, Ted Yu, Michael Stack, and Jean-Daniel
Cryans.

Changes
---

Fixed bug link. Added JD.

JD -- the code that is similar to merging is

- #handleOverlapGroup
- inMeta !inHdfs isDeployed (in another rev I've added an unassign and
believe I still have the disable/delete problem).

Summary
---

I'm posting a preliminary version that I'm currently testing on real clusters.
The tests are flakey on the 0.90 branch (so there is something async that I
didn't synchronize properly), and there are a few more TODO's I want to knock
out before this is ready for full review to be considered for committing. It's
got some problems I need some advice figuring out.

Problem 1:

In the unit tests, I have a few cases where I fabricate new regions and try to
force the overlapping regions to be closed. For some of these, I cannot delete
a table after it is repaired without causing subsequent tests to fail. I think
this is due to a few things:

1) The disable table handler uses in-memory assignment manager state while
delete uses in META assignment information.
2) Currently I'm using the sneaky closeRegion that purposely doesn't go through
the master and in turn doesn't modify in-memory state – disable uses out of
date in-memory region assignments. If I use the unassign method sends RIT
transitions to the master, but which ends up attempting to assign it again,
causing timing/transient states.

What is a good way to clear the HMaster's assignment manager's assignment data
for particular regions or to force it to re-read from META? (without modifying
the 0.90 HBase's it is meant to repair).

Problem 2:

Sometimes test fail reporting HOLE_IN_REGION_CHAIN and
SERVER_DOES_NOT_MATCH_META. This means the old and new regions are confiused
with each other and basically something is still happening asynchronously. I
think this is the new region is being assigned and is still transitioning.
Sound about right? To make the unit test deterministic, should hbck wait for
these to settle or should just the unit test wait?

This addresses bug HBASE-5128.
https://issues.apache.org/jira/browse/HBASE-5128

Diffs
-

src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java 6d3401d
src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java a3d8b8b
src/main/java/org/apache/hadoop/hbase/util/hbck/OfflineMetaRepair.java
29e8bb2

src/main/java/org/apache/hadoop/hbase/util/hbck/TableIntegrityErrorHandler.java
PRE-CREATION
src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java a640d57
src/test/java/org/apache/hadoop/hbase/util/hbck/HbckTestingUtil.java dbb97f8

src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildBase.java
3e8729d

src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildHole.java
11a1151

src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildOverlap.java
4a09ce2

Diff: https://reviews.apache.org/r/3435/diff

Testing
---

All unit tests pass sometimes. Some fail sometimes (generally the cases that
fabricate new regions).

Not ready for commit.

Thanks,

jmhsieh

[uber hbck] Enable hbck to automatically repair table integrity problems as
well as region consistency problems while online.
-

Key: HBASE-5128
URL: https://issues.apache.org/jira/browse/HBASE-5128
Project: HBase
Issue Type: New Feature
Components: hbck
Affects Versions: 0.92.0, 0.90.5
Reporter: Jonathan Hsieh
Assignee: Jonathan Hsieh

The current (0.90.5, 0.92.0rc2) versions of hbck detects most of region
consistency and table integrity invariant violations. However with '-fix' it
can only automatically repair region consistency cases having to do with
deployment problems. This updated version should be able to handle all cases
(including a new orphan regiondir case). When complete will likely deprecate
the OfflineMetaRepair tool and subsume several open META-hole related issue.
Here's the approach (from the comment of at the top of the new version of the
file).
{code}
/**
* HBaseFsck (hbck) is a tool for checking and repairing region consistency
and
* table integrity.
*
* Region consistency checks

[jira] [Commented] (HBASE-5153) HConnection re-creation in HTable after HConnection abort

[
https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184033#comment-13184033
]

Hadoop QA commented on HBASE-5153:
--

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12510179/HBASE-5153-V3.patch
against trunk revision .

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/727//console

This message is automatically generated.

HConnection re-creation in HTable after HConnection abort
-

Key: HBASE-5153
URL: https://issues.apache.org/jira/browse/HBASE-5153
Project: HBase
Issue Type: Bug
Components: client
Affects Versions: 0.90.4
Reporter: Jieshan Bean
Assignee: Jieshan Bean
Fix For: 0.90.6

Attachments: HBASE-5153-V2.patch, HBASE-5153-V3.patch,
HBASE-5153.patch

HBASE-4893 is related to this issue. In that issue, we know, if multi-threads
share a same connection, once this connection got abort in one thread, the
other threads will got a
HConnectionManager$HConnectionImplementation@18fb1f7 closed exception.
It solve the problem of stale connection can't removed. But the orignal
HTable instance cann't be continue to use. The connection in HTable should be
recreated.
Actually, there's two aproach to solve this:
1. In user code, once catch an IOE, close connection and re-create HTable
instance. We can use this as a workaround.
2. In HBase Client side, catch this exception, and re-create connection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5153) HConnection re-creation in HTable after HConnection abort


[ 
https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184092#comment-13184092
 ] 

Zhihong Yu commented on HBASE-5153:
---

@Jieshan:
Can you prepare a patch for trunk ?

 HConnection re-creation in HTable after HConnection abort
 -

 Key: HBASE-5153
 URL: https://issues.apache.org/jira/browse/HBASE-5153
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.4
Reporter: Jieshan Bean
Assignee: Jieshan Bean
 Fix For: 0.90.6

 Attachments: HBASE-5153-V2.patch, HBASE-5153-V3.patch, 
 HBASE-5153.patch


 HBASE-4893 is related to this issue. In that issue, we know, if multi-threads 
 share a same connection, once this connection got abort in one thread, the 
 other threads will got a 
 HConnectionManager$HConnectionImplementation@18fb1f7 closed exception.
 It solve the problem of stale connection can't removed. But the orignal 
 HTable instance cann't be continue to use. The connection in HTable should be 
 recreated.
 Actually, there's two aproach to solve this:
 1. In user code, once catch an IOE, close connection and re-create HTable 
 instance. We can use this as a workaround.
 2. In HBase Client side, catch this exception, and re-create connection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.)


 [ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5163:
--

Summary: TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on 
Jenkins or hadoop QA (The directory is already locked.)  (was: 
TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
hadoop QA on trunk (The directory is already locked.))

 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or 
 hadoop QA (The directory is already locked.)
 --

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5163.patch


 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last datanode, the directory will conflict on
 //  creation
 dfsCluster.stopDataNode(0);
 try {
   dfsCluster
 .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
   fail(There should be an exception because the directory already 
 exists);
 } catch (IOException e) {
   assertTrue( e.getMessage().contains(The directory is already 
 locked.));
   LOG.info(Expected (!) exception caught  + e.getMessage());
 }
 // Works, as we kill the last datanode, we can now restart 2 datanodes
 // This makes us back with 2 nodes
 dfsCluster.stopDataNode(0);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
   }
 {noformat}
 And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
 because when we do
 {noformat}
 DatanodeInfo[] pipeline = getPipeline(log);
 assertTrue(pipeline.length == fs.getDefaultReplication());
 {noformat}
 and then kill the datanodes in the pipeline, we will have:
  - most of the time: pipeline = 1  2, so after killing 12 we can start a 
 new datanode that will reuse the available 2's directory.
  - sometimes: pipeline = 1  3. In this case,when we

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.)


[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184113#comment-13184113
 ] 

Zhihong Yu commented on HBASE-5163:
---

Integrated to TRUNK.

Thanks for the patch, N.

Thanks for the review, Stack.

 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or 
 hadoop QA (The directory is already locked.)
 --

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5163.patch


 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last datanode, the directory will conflict on
 //  creation
 dfsCluster.stopDataNode(0);
 try {
   dfsCluster
 .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
   fail(There should be an exception because the directory already 
 exists);
 } catch (IOException e) {
   assertTrue( e.getMessage().contains(The directory is already 
 locked.));
   LOG.info(Expected (!) exception caught  + e.getMessage());
 }
 // Works, as we kill the last datanode, we can now restart 2 datanodes
 // This makes us back with 2 nodes
 dfsCluster.stopDataNode(0);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
   }
 {noformat}
 And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
 because when we do
 {noformat}
 DatanodeInfo[] pipeline = getPipeline(log);
 assertTrue(pipeline.length == fs.getDefaultReplication());
 {noformat}
 and then kill the datanodes in the pipeline, we will have:
  - most of the time: pipeline = 1  2, so after killing 12 we can start a 
 new datanode that will reuse the available 2's directory.
  - sometimes: pipeline = 1  3. In this case,when we try to launch the new 
 datanode, it fails because it wants to use the same directory as the still 
 alive '2'.
 There are two

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184120#comment-13184120
 ] 

Zhihong Yu commented on HBASE-5179:
---

{code}
+  private final SetServerName processingDeadServers = new 
HashSetServerName();
{code}
The field name above sounds like method name. How about naming it 
deadServersUnderProcessing ? Related method names should be changed as well.

{code}
+   * Called on startup. Figures whether a fresh cluster start of we are joining
{code}
should read 'start or we are'.

For ServerManager.java and DeadServer.java:
{code}
+  public SetServerName getProcessingDeadServers() {
+return this.deadservers.cloneProcessingDeadServers();
+  }
{code}
The method should be called cloneDeadServersUnderProcessing().

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5179:
--

Status: Patch Available  (was: Open)

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.)


[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184147#comment-13184147
 ] 

Hudson commented on HBASE-5163:
---

Integrated in HBase-TRUNK #2618 (See 
[https://builds.apache.org/job/HBase-TRUNK/2618/])
HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on 
Jenkins or hadoop QA (The directory is already locked.) (N Keywal)

tedyu : 
Files : 
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java


 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or 
 hadoop QA (The directory is already locked.)
 --

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5163.patch


 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last datanode, the directory will conflict on
 //  creation
 dfsCluster.stopDataNode(0);
 try {
   dfsCluster
 .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
   fail(There should be an exception because the directory already 
 exists);
 } catch (IOException e) {
   assertTrue( e.getMessage().contains(The directory is already 
 locked.));
   LOG.info(Expected (!) exception caught  + e.getMessage());
 }
 // Works, as we kill the last datanode, we can now restart 2 datanodes
 // This makes us back with 2 nodes
 dfsCluster.stopDataNode(0);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
   }
 {noformat}
 And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
 because when we do
 {noformat}
 DatanodeInfo[] pipeline = getPipeline(log);
 assertTrue(pipeline.length == fs.getDefaultReplication());
 {noformat}
 and then kill the datanodes in the pipeline, we will have:
  - most of the time: pipeline = 1  2, so after killing

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-11 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184155#comment-13184155
]

Hadoop QA commented on HBASE-5179:
--

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12510164/hbase-5179.patch
against trunk revision .

+1 @author. The patch does not contain any @author tags.

-1 javadoc. The javadoc tool appears to have generated -147 warning
messages.

+1 javac. The applied patch does not increase the total number of javac
compiler warnings.

-1 findbugs. The patch appears to introduce 78 new Findbugs (version
1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of
release audit warnings.

-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.mapreduce.TestImportTsv
org.apache.hadoop.hbase.mapred.TestTableMapReduce
org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results:
https://builds.apache.org/job/PreCommit-HBASE-Build/728//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/728//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/728//console

This message is automatically generated.

Concurrent processing of processFaileOver and ServerShutdownHandler may
cause region is assigned before completing split log, it would cause data loss
---

Key: HBASE-5179
URL: https://issues.apache.org/jira/browse/HBASE-5179
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Attachments: hbase-5179.patch

If master's processing its failover and ServerShutdownHandler's processing
happen concurrently, it may appear following case.
1.master completed splitLogAfterStartup()
2.RegionserverA restarts, and ServerShutdownHandler is processing.
3.master starts to rebuildUserRegions, and RegionserverA is considered as
dead server.
4.master starts to assign regions of RegionserverA because it is a dead
server by step3.
However, when doing step4(assigning region), ServerShutdownHandler may be
doing split log, Therefore, it may cause data loss.

[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted


[ 
https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184164#comment-13184164
 ] 

ramkrishna.s.vasudevan commented on HBASE-5155:
---

I could not upload the patch today as still some test case is failing.  Will 
upload it tomorrow.

 ServerShutDownHandler And Disable/Delete should not happen parallely leading 
 to recreation of regions that were deleted
 ---

 Key: HBASE-5155
 URL: https://issues.apache.org/jira/browse/HBASE-5155
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.4
Reporter: ramkrishna.s.vasudevan
Priority: Blocker

 ServerShutDownHandler and disable/delete table handler races.  This is not an 
 issue due to TM.
 - A regionserver goes down.  In our cluster the regionserver holds lot of 
 regions.
 - A region R1 has two daughters D1 and D2.
 - The ServerShutdownHandler gets called and scans the META and gets all the 
 user regions
 - Parallely a table is disabled. (No problem in this step).
 - Delete table is done.
 - The tables and its regions are deleted including R1, D1 and D2.. (So META 
 is cleaned)
 - Now ServerShutdownhandler starts to processTheDeadRegion
 {code}
  if (hri.isOffline()  hri.isSplit()) {
   LOG.debug(Offlined and split region  + hri.getRegionNameAsString() +
 ; checking daughter presence);
   fixupDaughters(result, assignmentManager, catalogTracker);
 {code}
 As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 
 {code}
 if (isDaughterMissing(catalogTracker, daughter)) {
   LOG.info(Fixup; missing daughter  + daughter.getRegionNameAsString());
   MetaEditor.addDaughter(catalogTracker, daughter, null);
   // TODO: Log WARN if the regiondir does not exist in the fs.  If its not
   // there then something wonky about the split -- things will keep going
   // but could be missing references to parent region.
   // And assign it.
   assignmentManager.assign(daughter, true);
 {code}
 we call assign of the daughers.  
 Now after this we again start with the below code.
 {code}
 if (processDeadRegion(e.getKey(), e.getValue(),
 this.services.getAssignmentManager(),
 this.server.getCatalogTracker())) {
   this.services.getAssignmentManager().assign(e.getKey(), true);
 {code}
 Now when the SSH scanned the META it had R1, D1 and D2.
 So as part of the above code D1 and D2 which where assigned by fixUpDaughters
 is again assigned by 
 {code}
 this.services.getAssignmentManager().assign(e.getKey(), true);
 {code}
 Thus leading to a zookeeper issue due to bad version and killing the master.
 The important part here is the regions that were deleted are recreated which 
 i think is more critical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-11 Thread ramkrishna.s.vasudevan (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184170#comment-13184170
 ] 

ramkrishna.s.vasudevan commented on HBASE-5179:
---

@Chunhui
Is this issue applicable for 0.90.6? If so can you prepare a patch for 0.90 
also?

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5115) Change HBase color from purple to International Orange (Engineering)


 [ 
https://issues.apache.org/jira/browse/HBASE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5115:
-

Attachment: 01_orange.svg
01_orange.png

 Change HBase color from purple to International Orange (Engineering)
 

 Key: HBASE-5115
 URL: https://issues.apache.org/jira/browse/HBASE-5115
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: stack
 Attachments: 01_orange.png, 01_orange.svg


 See http://en.wikipedia.org/wiki/International_orange  See the bit about the 
 color of the golden gate bridge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-5115) Change HBase color from purple to International Orange (Engineering)

2012-01-11 Thread stack (Assigned) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack reassigned HBASE-5115:


Assignee: stack

 Change HBase color from purple to International Orange (Engineering)
 

 Key: HBASE-5115
 URL: https://issues.apache.org/jira/browse/HBASE-5115
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: stack
 Attachments: 01_orange.png, 01_orange.svg


 See http://en.wikipedia.org/wiki/International_orange  See the bit about the 
 color of the golden gate bridge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5115) Change HBase color from purple to International Orange (Engineering)


[ 
https://issues.apache.org/jira/browse/HBASE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184171#comment-13184171
 ] 

stack commented on HBASE-5115:
--

Here is logo done in IA(Engineering).

 Change HBase color from purple to International Orange (Engineering)
 

 Key: HBASE-5115
 URL: https://issues.apache.org/jira/browse/HBASE-5115
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: stack
 Attachments: 01_orange.png, 01_orange.svg, H_orange.png, H_orange.svg


 See http://en.wikipedia.org/wiki/International_orange  See the bit about the 
 color of the golden gate bridge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5115) Change HBase color from purple to International Orange (Engineering)

2012-01-11 Thread ramkrishna.s.vasudevan (Commented) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5115:
-

Attachment: H_orange.svg
H_orange.png

 Change HBase color from purple to International Orange (Engineering)
 

 Key: HBASE-5115
 URL: https://issues.apache.org/jira/browse/HBASE-5115
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: stack
 Attachments: 01_orange.png, 01_orange.svg, H_orange.png, H_orange.svg


 See http://en.wikipedia.org/wiki/International_orange  See the bit about the 
 color of the golden gate bridge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

[
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184181#comment-13184181
]

ramkrishna.s.vasudevan commented on HBASE-5179:
---

@Chunhui
Can you take a look at HBAE-4748. It is similar to this but there the data
loss was w.r.t META leading to more critical data loss. But it is quite rare
but still possible. Do you have any suggestions for that?

Concurrent processing of processFaileOver and ServerShutdownHandler may
cause region is assigned before completing split log, it would cause data loss
---

[jira] [Updated] (HBASE-3565) Add a metric to keep track of slow HLog appends


 [ 
https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-3565:
--

Status: Patch Available  (was: Open)

 Add a metric to keep track of slow HLog appends
 ---

 Key: HBASE-3565
 URL: https://issues.apache.org/jira/browse/HBASE-3565
 Project: HBase
  Issue Type: Improvement
  Components: metrics, regionserver
Reporter: Benoit Sigoure
Assignee: Mubarak Seyed
  Labels: monitoring
 Fix For: 0.94.0

 Attachments: HBASE-3565.trunk.v1.patch


 Whenever an edit takes too long to be written to an HLog, HBase logs a 
 warning such as this one:
 {code}
 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
 IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; 
 editcount=126050
 {code}
 I would like to have a counter incremented each time this happens and this 
 counter exposed via the metrics stuff in HBase so I can collect it in my 
 monitoring system.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-11 Thread ramkrishna.s.vasudevan (Commented) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5179:
--

Attachment: 5179-v2.txt

Chunhui's patch for TRUNK with minor renaming.

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler


[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184203#comment-13184203
 ] 

ramkrishna.s.vasudevan commented on HBASE-5120:
---

Latest patch available.. 

 Timeout monitor races with table disable handler
 

 Key: HBASE-5120
 URL: https://issues.apache.org/jira/browse/HBASE-5120
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Priority: Blocker
 Fix For: 0.94.0, 0.92.1

 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
 HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch


 Here is what J-D described here:
 https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
 I think I will retract from my statement that it used to be extremely racy 
 and caused more troubles than it fixed, on my first test I got a stuck 
 region in transition instead of being able to recover. The timeout was set to 
 2 minutes to be sure I hit it.
 First the region gets closed
 {quote}
 2012-01-04 00:16:25,811 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 sv4r5s38,62023,1325635980913 for region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 {quote}
 2 minutes later it times out:
 {quote}
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636185810, server=null
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,027 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 (offlining)
 {quote}
 100ms later the master finally gets the event:
 {quote}
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
 region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
 event for 1a4b111bcc228043e89f59c4c3f6a791
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
 deleting ZK node and removing from regions in transition, skipping assignment 
 of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Deleting existing unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
 region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
 {quote}
 At this point everything is fine, the region was processed as closed. But 
 wait, remember that line where it said it was going to force an unassign?
 {quote}
 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Creating unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
 2012-01-04 00:18:30,328 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 1a4b111bcc228043e89f59c4c3f6a791
 {quote}
 Now the master is confused, it recreated the RIT znode but the region doesn't 
 even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
 this is what's going on.
 The late ZK notification that the znode was deleted (but it got recreated 
 after):
 {quote}
 2012-01-04 00:19:33,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
 deleted.
 {quote}
 Then it prints this, and much later tries to unassign it again:
 {quote}
 2012-01-04 00:19:46,607 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition; 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636310328, server=null
 ...
 2012-01-04 00:20:39,623 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition;

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-11 Thread Zhihong Yu (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184205#comment-13184205
 ] 

Zhihong Yu commented on HBASE-5120:
---

Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ?

 Timeout monitor races with table disable handler
 

 Key: HBASE-5120
 URL: https://issues.apache.org/jira/browse/HBASE-5120
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Priority: Blocker
 Fix For: 0.94.0, 0.92.1

 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
 HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch


 Here is what J-D described here:
 https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
 I think I will retract from my statement that it used to be extremely racy 
 and caused more troubles than it fixed, on my first test I got a stuck 
 region in transition instead of being able to recover. The timeout was set to 
 2 minutes to be sure I hit it.
 First the region gets closed
 {quote}
 2012-01-04 00:16:25,811 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 sv4r5s38,62023,1325635980913 for region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 {quote}
 2 minutes later it times out:
 {quote}
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636185810, server=null
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,027 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 (offlining)
 {quote}
 100ms later the master finally gets the event:
 {quote}
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
 region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
 event for 1a4b111bcc228043e89f59c4c3f6a791
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
 deleting ZK node and removing from regions in transition, skipping assignment 
 of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Deleting existing unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
 region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
 {quote}
 At this point everything is fine, the region was processed as closed. But 
 wait, remember that line where it said it was going to force an unassign?
 {quote}
 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Creating unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
 2012-01-04 00:18:30,328 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 1a4b111bcc228043e89f59c4c3f6a791
 {quote}
 Now the master is confused, it recreated the RIT znode but the region doesn't 
 even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
 this is what's going on.
 The late ZK notification that the znode was deleted (but it got recreated 
 after):
 {quote}
 2012-01-04 00:19:33,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
 deleted.
 {quote}
 Then it prints this, and much later tries to unassign it again:
 {quote}
 2012-01-04 00:19:46,607 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition; 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636310328, server=null
 ...
 2012-01-04 00:20:39,623 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition;

[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler


[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184205#comment-13184205
 ] 

Zhihong Yu edited comment on HBASE-5120 at 1/11/12 5:17 PM:


Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ?
{code}
+LOG.debug(The deletion of the CLOSED node for the region 
++ region.getEncodedName() +  returned  + deleteNode);
{code}

  was (Author: zhi...@ebaysf.com):
Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ?
  
 Timeout monitor races with table disable handler
 

 Key: HBASE-5120
 URL: https://issues.apache.org/jira/browse/HBASE-5120
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Priority: Blocker
 Fix For: 0.94.0, 0.92.1

 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
 HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch


 Here is what J-D described here:
 https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
 I think I will retract from my statement that it used to be extremely racy 
 and caused more troubles than it fixed, on my first test I got a stuck 
 region in transition instead of being able to recover. The timeout was set to 
 2 minutes to be sure I hit it.
 First the region gets closed
 {quote}
 2012-01-04 00:16:25,811 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 sv4r5s38,62023,1325635980913 for region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 {quote}
 2 minutes later it times out:
 {quote}
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636185810, server=null
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,027 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 (offlining)
 {quote}
 100ms later the master finally gets the event:
 {quote}
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
 region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
 event for 1a4b111bcc228043e89f59c4c3f6a791
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
 deleting ZK node and removing from regions in transition, skipping assignment 
 of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Deleting existing unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
 region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
 {quote}
 At this point everything is fine, the region was processed as closed. But 
 wait, remember that line where it said it was going to force an unassign?
 {quote}
 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Creating unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
 2012-01-04 00:18:30,328 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 1a4b111bcc228043e89f59c4c3f6a791
 {quote}
 Now the master is confused, it recreated the RIT znode but the region doesn't 
 even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
 this is what's going on.
 The late ZK notification that the znode was deleted (but it got recreated 
 after):
 {quote}
 2012-01-04 00:19:33,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
 deleted.
 {quote}
 Then it prints this, and much later tries to unassign it again:
 {quote}
 2012-01-04 00:19:46,607 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on

[jira] [Commented] (HBASE-5153) HConnection re-creation in HTable after HConnection abort


[ 
https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184211#comment-13184211
 ] 

stack commented on HBASE-5153:
--

Patch looks good.  I like your addition of a specific Exception for closed 
state.

Does this have to be public Jieshan?

{code}
getRegionServerWithRetries
{code}

Same for processBatch and getRegionLocation.

If public should be in HTableInterface but they seem implementation methods 
rather than something that should be part of public interface.

A style nit -- i.e. not important but if you are going to redo the patch you 
miight want to address it -- is that you do this in 
handleConnectionClosedException

{code}
+if (ioe instanceof ConnectionClosedException) {
{code}

and the whole method is dealing with the case where above is true.  I'd suggest 
that you might do:

{code}
if (!(ioe instanceof ConnectionClosedException)) return;
{code}

... then you save a whole indent and its clear that the method is all about 
dealing with ConnectionClosedException.

Is it right including this in HTable?

{code}
getPauseTime
{code}

In trunk that is in a new ConnectionUtils class.  Maybe you have to do it for 
0.90?

I'm wondering if the class ConnectionClosedException needs to be public also?  
Its only used in this package, right?




 HConnection re-creation in HTable after HConnection abort
 -

 Key: HBASE-5153
 URL: https://issues.apache.org/jira/browse/HBASE-5153
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.4
Reporter: Jieshan Bean
Assignee: Jieshan Bean
 Fix For: 0.90.6

 Attachments: HBASE-5153-V2.patch, HBASE-5153-V3.patch, 
 HBASE-5153.patch


 HBASE-4893 is related to this issue. In that issue, we know, if multi-threads 
 share a same connection, once this connection got abort in one thread, the 
 other threads will got a 
 HConnectionManager$HConnectionImplementation@18fb1f7 closed exception.
 It solve the problem of stale connection can't removed. But the orignal 
 HTable instance cann't be continue to use. The connection in HTable should be 
 recreated.
 Actually, there's two aproach to solve this:
 1. In user code, once catch an IOE, close connection and re-create HTable 
 instance. We can use this as a workaround.
 2. In HBase Client side, catch this exception, and re-create connection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3565) Add a metric to keep track of slow HLog appends

2012-01-11 Thread ramkrishna.s.vasudevan (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184220#comment-13184220
]

Hadoop QA commented on HBASE-3565:
--

-1 overall. Here are the results of testing the latest attachment

http://issues.apache.org/jira/secure/attachment/12510132/HBASE-3565.trunk.v1.patch
against trunk revision .

+1 @author. The patch does not contain any @author tags.

-1 javadoc. The javadoc tool appears to have generated -147 warning
messages.

+1 javac. The applied patch does not increase the total number of javac
compiler warnings.

-1 findbugs. The patch appears to introduce 78 new Findbugs (version
1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of
release audit warnings.

-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.replication.TestReplicationPeer
org.apache.hadoop.hbase.mapreduce.TestImportTsv
org.apache.hadoop.hbase.mapred.TestTableMapReduce
org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results:
https://builds.apache.org/job/PreCommit-HBASE-Build/729//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/729//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/729//console

This message is automatically generated.

Add a metric to keep track of slow HLog appends
---

Key: HBASE-3565
URL: https://issues.apache.org/jira/browse/HBASE-3565
Project: HBase
Issue Type: Improvement
Components: metrics, regionserver
Reporter: Benoit Sigoure
Assignee: Mubarak Seyed
Labels: monitoring
Fix For: 0.94.0

Attachments: HBASE-3565.trunk.v1.patch

Whenever an edit takes too long to be written to an HLog, HBase logs a
warning such as this one:
{code}
2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog;
editcount=126050
{code}
I would like to have a counter incremented each time this happens and this
counter exposed via the metrics stuff in HBase so I can collect it in my
monitoring system.

[jira] [Commented] (HBASE-5150) Fail in a thread may not fail a test, clean up log splitting test

2012-01-11 Thread Jimmy Xiang (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184221#comment-13184221
 ] 

Jimmy Xiang commented on HBASE-5150:


Those failed tests passed on my local box.

 Fail in a thread may not fail a test, clean up log splitting test
 -

 Key: HBASE-5150
 URL: https://issues.apache.org/jira/browse/HBASE-5150
 Project: HBase
  Issue Type: Test
Affects Versions: 0.94.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: hbase-5150.txt, hbase_5150_v3.patch


 This is to clean up some tests for HBASE-5081.  The Assert.fail method in a 
 separate thread will terminate the thread, but may not fail the test.
 We can use callable, so that we can get the error in getting the result. 
 Some documentation to explain the test will be helpful too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler


 [ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5120:
--

Status: Patch Available  (was: Open)

 Timeout monitor races with table disable handler
 

 Key: HBASE-5120
 URL: https://issues.apache.org/jira/browse/HBASE-5120
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Assignee: ramkrishna.s.vasudevan
Priority: Blocker
 Fix For: 0.94.0, 0.92.1

 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
 HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch


 Here is what J-D described here:
 https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
 I think I will retract from my statement that it used to be extremely racy 
 and caused more troubles than it fixed, on my first test I got a stuck 
 region in transition instead of being able to recover. The timeout was set to 
 2 minutes to be sure I hit it.
 First the region gets closed
 {quote}
 2012-01-04 00:16:25,811 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 sv4r5s38,62023,1325635980913 for region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 {quote}
 2 minutes later it times out:
 {quote}
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636185810, server=null
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,027 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 (offlining)
 {quote}
 100ms later the master finally gets the event:
 {quote}
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
 region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
 event for 1a4b111bcc228043e89f59c4c3f6a791
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
 deleting ZK node and removing from regions in transition, skipping assignment 
 of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Deleting existing unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
 region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
 {quote}
 At this point everything is fine, the region was processed as closed. But 
 wait, remember that line where it said it was going to force an unassign?
 {quote}
 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Creating unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
 2012-01-04 00:18:30,328 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 1a4b111bcc228043e89f59c4c3f6a791
 {quote}
 Now the master is confused, it recreated the RIT znode but the region doesn't 
 even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
 this is what's going on.
 The late ZK notification that the znode was deleted (but it got recreated 
 after):
 {quote}
 2012-01-04 00:19:33,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
 deleted.
 {quote}
 Then it prints this, and much later tries to unassign it again:
 {quote}
 2012-01-04 00:19:46,607 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition; 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636310328, server=null
 ...
 2012-01-04 00:20:39,623 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition;

[jira] [Assigned] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-11 Thread ramkrishna.s.vasudevan (Assigned) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan reassigned HBASE-5120:
-

Assignee: ramkrishna.s.vasudevan

 Timeout monitor races with table disable handler
 

 Key: HBASE-5120
 URL: https://issues.apache.org/jira/browse/HBASE-5120
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Assignee: ramkrishna.s.vasudevan
Priority: Blocker
 Fix For: 0.94.0, 0.92.1

 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
 HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch


 Here is what J-D described here:
 https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
 I think I will retract from my statement that it used to be extremely racy 
 and caused more troubles than it fixed, on my first test I got a stuck 
 region in transition instead of being able to recover. The timeout was set to 
 2 minutes to be sure I hit it.
 First the region gets closed
 {quote}
 2012-01-04 00:16:25,811 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 sv4r5s38,62023,1325635980913 for region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 {quote}
 2 minutes later it times out:
 {quote}
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636185810, server=null
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,027 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 (offlining)
 {quote}
 100ms later the master finally gets the event:
 {quote}
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
 region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
 event for 1a4b111bcc228043e89f59c4c3f6a791
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
 deleting ZK node and removing from regions in transition, skipping assignment 
 of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Deleting existing unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
 region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
 {quote}
 At this point everything is fine, the region was processed as closed. But 
 wait, remember that line where it said it was going to force an unassign?
 {quote}
 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Creating unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
 2012-01-04 00:18:30,328 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 1a4b111bcc228043e89f59c4c3f6a791
 {quote}
 Now the master is confused, it recreated the RIT znode but the region doesn't 
 even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
 this is what's going on.
 The late ZK notification that the znode was deleted (but it got recreated 
 after):
 {quote}
 2012-01-04 00:19:33,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
 deleted.
 {quote}
 Then it prints this, and much later tries to unassign it again:
 {quote}
 2012-01-04 00:19:46,607 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition; 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636310328, server=null
 ...
 2012-01-04 00:20:39,623 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition;

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-11 Thread ramkrishna.s.vasudevan (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5120:
--

Attachment: HBASE-5120_5.patch

Changed debug to error.

 Timeout monitor races with table disable handler
 

 Key: HBASE-5120
 URL: https://issues.apache.org/jira/browse/HBASE-5120
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Priority: Blocker
 Fix For: 0.94.0, 0.92.1

 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
 HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch


 Here is what J-D described here:
 https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
 I think I will retract from my statement that it used to be extremely racy 
 and caused more troubles than it fixed, on my first test I got a stuck 
 region in transition instead of being able to recover. The timeout was set to 
 2 minutes to be sure I hit it.
 First the region gets closed
 {quote}
 2012-01-04 00:16:25,811 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 sv4r5s38,62023,1325635980913 for region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 {quote}
 2 minutes later it times out:
 {quote}
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636185810, server=null
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,027 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 (offlining)
 {quote}
 100ms later the master finally gets the event:
 {quote}
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
 region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
 event for 1a4b111bcc228043e89f59c4c3f6a791
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
 deleting ZK node and removing from regions in transition, skipping assignment 
 of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Deleting existing unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
 region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
 {quote}
 At this point everything is fine, the region was processed as closed. But 
 wait, remember that line where it said it was going to force an unassign?
 {quote}
 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Creating unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
 2012-01-04 00:18:30,328 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 1a4b111bcc228043e89f59c4c3f6a791
 {quote}
 Now the master is confused, it recreated the RIT znode but the region doesn't 
 even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
 this is what's going on.
 The late ZK notification that the znode was deleted (but it got recreated 
 after):
 {quote}
 2012-01-04 00:19:33,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
 deleted.
 {quote}
 Then it prints this, and much later tries to unassign it again:
 {quote}
 2012-01-04 00:19:46,607 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition; 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636310328, server=null
 ...
 2012-01-04 00:20:39,623 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition;

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-11 Thread ramkrishna.s.vasudevan (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5120:
--

Status: Open  (was: Patch Available)

 Timeout monitor races with table disable handler
 

 Key: HBASE-5120
 URL: https://issues.apache.org/jira/browse/HBASE-5120
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Priority: Blocker
 Fix For: 0.94.0, 0.92.1

 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
 HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch


 Here is what J-D described here:
 https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
 I think I will retract from my statement that it used to be extremely racy 
 and caused more troubles than it fixed, on my first test I got a stuck 
 region in transition instead of being able to recover. The timeout was set to 
 2 minutes to be sure I hit it.
 First the region gets closed
 {quote}
 2012-01-04 00:16:25,811 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 sv4r5s38,62023,1325635980913 for region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 {quote}
 2 minutes later it times out:
 {quote}
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636185810, server=null
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,027 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 (offlining)
 {quote}
 100ms later the master finally gets the event:
 {quote}
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
 region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
 event for 1a4b111bcc228043e89f59c4c3f6a791
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
 deleting ZK node and removing from regions in transition, skipping assignment 
 of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Deleting existing unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
 region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
 {quote}
 At this point everything is fine, the region was processed as closed. But 
 wait, remember that line where it said it was going to force an unassign?
 {quote}
 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Creating unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
 2012-01-04 00:18:30,328 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 1a4b111bcc228043e89f59c4c3f6a791
 {quote}
 Now the master is confused, it recreated the RIT znode but the region doesn't 
 even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
 this is what's going on.
 The late ZK notification that the znode was deleted (but it got recreated 
 after):
 {quote}
 2012-01-04 00:19:33,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
 deleted.
 {quote}
 Then it prints this, and much later tries to unassign it again:
 {quote}
 2012-01-04 00:19:46,607 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition; 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636310328, server=null
 ...
 2012-01-04 00:20:39,623 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition; 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-11 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184227#comment-13184227
]

ramkrishna.s.vasudevan commented on HBASE-5179:
---

Patch looks good to me.. Tomorrow will try out in the cluster.

Concurrent processing of processFaileOver and ServerShutdownHandler may
cause region is assigned before completing split log, it would cause data loss
---

[jira] [Commented] (HBASE-5150) Fail in a thread may not fail a test, clean up log splitting test

2012-01-11 Thread Jimmy Xiang (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184234#comment-13184234
 ] 

Jimmy Xiang commented on HBASE-5150:


@Prakash and Ted, are you ok with this patch? I changed the 3sec wait time to 
2sec.

 Fail in a thread may not fail a test, clean up log splitting test
 -

 Key: HBASE-5150
 URL: https://issues.apache.org/jira/browse/HBASE-5150
 Project: HBase
  Issue Type: Test
Affects Versions: 0.94.0
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Minor
 Attachments: hbase-5150.txt, hbase_5150_v3.patch


 This is to clean up some tests for HBASE-5081.  The Assert.fail method in a 
 separate thread will terminate the thread, but may not fail the test.
 We can use callable, so that we can get the error in getting the result. 
 Some documentation to explain the test will be helpful too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5179:
--

Attachment: 5179-90.txt

Chunhui's patch rebased for 0.90

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

[
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184239#comment-13184239
]

Hadoop QA commented on HBASE-5179:
--

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12510206/5179-v2.txt
against trunk revision .

+1 @author. The patch does not contain any @author tags.

-1 javadoc. The javadoc tool appears to have generated -147 warning
messages.

+1 javac. The applied patch does not increase the total number of javac
compiler warnings.

-1 findbugs. The patch appears to introduce 78 new Findbugs (version
1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of
release audit warnings.

-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.master.TestSplitLogManager
org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
org.apache.hadoop.hbase.client.TestAdmin
org.apache.hadoop.hbase.mapred.TestTableMapReduce
org.apache.hadoop.hbase.mapreduce.TestImportTsv

Test results:
https://builds.apache.org/job/PreCommit-HBASE-Build/730//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/730//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/730//console

This message is automatically generated.

Concurrent processing of processFaileOver and ServerShutdownHandler may
cause region is assigned before completing split log, it would cause data loss
---

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

[
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184240#comment-13184240
]

Hadoop QA commented on HBASE-5179:
--

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12510215/5179-90.txt
against trunk revision .

+1 @author. The patch does not contain any @author tags.

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/732//console

This message is automatically generated.

Concurrent processing of processFaileOver and ServerShutdownHandler may
cause region is assigned before completing split log, it would cause data loss
---

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

[
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhihong Yu updated HBASE-5179:
--

Comment: was deleted

(was: -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12510215/5179-90.txt
against trunk revision .

+1 @author. The patch does not contain any @author tags.

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/732//console

This message is automatically generated.)

Concurrent processing of processFaileOver and ServerShutdownHandler may
cause region is assigned before completing split log, it would cause data loss
---

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184244#comment-13184244
 ] 

Zhihong Yu commented on HBASE-5179:
---

I ran the following on MacBook and they passed:
{code}
 1143  mt -Dtest=TestSplitLogManager
 1145  mt -Dtest=TestAdmin#testShouldCloseTheRegionBasedOnTheEncodedRegionName
{code}

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler


[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184251#comment-13184251
 ] 

Zhihong Yu commented on HBASE-5120:
---

+1 on patch v5.

 Timeout monitor races with table disable handler
 

 Key: HBASE-5120
 URL: https://issues.apache.org/jira/browse/HBASE-5120
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Assignee: ramkrishna.s.vasudevan
Priority: Blocker
 Fix For: 0.94.0, 0.92.1

 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
 HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch


 Here is what J-D described here:
 https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
 I think I will retract from my statement that it used to be extremely racy 
 and caused more troubles than it fixed, on my first test I got a stuck 
 region in transition instead of being able to recover. The timeout was set to 
 2 minutes to be sure I hit it.
 First the region gets closed
 {quote}
 2012-01-04 00:16:25,811 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 sv4r5s38,62023,1325635980913 for region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 {quote}
 2 minutes later it times out:
 {quote}
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636185810, server=null
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,027 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 (offlining)
 {quote}
 100ms later the master finally gets the event:
 {quote}
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
 region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
 event for 1a4b111bcc228043e89f59c4c3f6a791
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
 deleting ZK node and removing from regions in transition, skipping assignment 
 of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Deleting existing unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
 region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
 {quote}
 At this point everything is fine, the region was processed as closed. But 
 wait, remember that line where it said it was going to force an unassign?
 {quote}
 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Creating unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
 2012-01-04 00:18:30,328 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 1a4b111bcc228043e89f59c4c3f6a791
 {quote}
 Now the master is confused, it recreated the RIT znode but the region doesn't 
 even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
 this is what's going on.
 The late ZK notification that the znode was deleted (but it got recreated 
 after):
 {quote}
 2012-01-04 00:19:33,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
 deleted.
 {quote}
 Then it prints this, and much later tries to unassign it again:
 {quote}
 2012-01-04 00:19:46,607 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition; 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636310328, server=null
 ...
 2012-01-04 00:20:39,623 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition;

[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler

[
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184258#comment-13184258
]

Hadoop QA commented on HBASE-5120:
--

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12510211/HBASE-5120_5.patch
against trunk revision .

+1 @author. The patch does not contain any @author tags.

-1 javadoc. The javadoc tool appears to have generated -147 warning
messages.

+1 javac. The applied patch does not increase the total number of javac
compiler warnings.

-1 findbugs. The patch appears to introduce 79 new Findbugs (version
1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of
release audit warnings.

Test results:
https://builds.apache.org/job/PreCommit-HBASE-Build/731//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-HBASE-Build/731//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/731//console

This message is automatically generated.

Timeout monitor races with table disable handler

Key: HBASE-5120
URL: https://issues.apache.org/jira/browse/HBASE-5120
Project: HBase
Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Assignee: ramkrishna.s.vasudevan
Priority: Blocker
Fix For: 0.94.0, 0.92.1

Attachments: HBASE-5120.patch, HBASE-5120_1.patch,
HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch

[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol

[
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184264#comment-13184264
]

Zhihong Yu commented on HBASE-5139:
---

I am going to integrate patch v2 if there is no objection.

Compute (weighted) median using AggregateProtocol
-

Key: HBASE-5139
URL: https://issues.apache.org/jira/browse/HBASE-5139
Project: HBase
Issue Type: Sub-task
Reporter: Zhihong Yu
Assignee: Zhihong Yu
Attachments: 5139-v2.txt

Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights.
This task finds out the median value among the values of cf:cq1 (See
http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
This can be done in two passes.
The first pass utilizes AggregateProtocol where the following tuple is
returned from each region:
(partial-sum-of-values, partial-sum-of-weights)
The start rowkey (supplied by coprocessor framework) would be used to sort
the tuples. This way we can determine which region (called R) contains the
(weighted) median. partial-sum-of-weights can be 0 if unweighted median is
sought
The second pass involves scanning the table, beginning with startrow of
region R and computing partial (weighted) sum until the threshold of S/2 is
crossed. The (weighted) median is returned.
However, this approach wouldn't work if there is mutation in the underlying
table between pass one and pass two.
In that case, sequential scanning seems to be the solution which is slower
than the above approach.

[jira] [Commented] (HBASE-4224) Need a flush by regionserver rather than by table option

2012-01-11 Thread Harsh J (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184281#comment-13184281
 ] 

Harsh J commented on HBASE-4224:


[Dropping by from the dev lists…, have not followed otherwise]

I'd certainly like reading flushAllRegions() over flushRegions(null). Can we 
not also have it as a utility function in HRServer instead if HRI/f, if the 
interface changing is much to be worried about?

 Need a flush by regionserver rather than by table option
 

 Key: HBASE-4224
 URL: https://issues.apache.org/jira/browse/HBASE-4224
 Project: HBase
  Issue Type: Bug
  Components: shell
Reporter: stack
Assignee: Akash Ashok
 Attachments: HBase-4224-v2.patch, HBase-4224.patch


 This evening needed to clean out logs on the cluster.  logs are by 
 regionserver.  to let go of logs, we need to have all edits emptied from 
 memory.  only flush is by table or region.  We need to be able to flush the 
 regionserver.  Need to add this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4440) add an option to presplit table to PerformanceEvaluation

2012-01-11 Thread Sujee Maniyam (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184284#comment-13184284
]

Sujee Maniyam commented on HBASE-4440:
--

so you are proposing that

1) whether we use presplit option or not, table has to be recreated for all
write-mode tests.

This changes the behavior for all write-tests. Currently table is only
created if it doesn't exist.

2) or pre-split should try to split the table without re-creating it.

add an option to presplit table to PerformanceEvaluation

Key: HBASE-4440
URL: https://issues.apache.org/jira/browse/HBASE-4440
Project: HBase
Issue Type: Improvement
Components: util
Reporter: Sujee Maniyam
Assignee: Sujee Maniyam
Priority: Minor
Labels: benchmark
Fix For: 0.94.0

Attachments: PerformanceEvaluation.java,
PerformanceEvaluation_HBASE_4440.patch,
PerformanceEvaluation_HBASE_4440_2.patch

PerformanceEvaluation a quick way to 'benchmark' a HBase cluster. The
current 'write*' operations do not pre-split the table. Pre splitting the
table will really boost the insert performance.
It would be nice to have an option to enable pre-splitting table before the
inserts begin.
it would look something like:
(a) hbase ...PerformanceEvaluation --presplit=10 other options
(b) hbase ...PerformanceEvaluation --presplit other options
(b) will try to presplit the table on some default value (say number of
region servers)

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184287#comment-13184287
 ] 

stack commented on HBASE-5179:
--

Its hard to do a test for this?

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3565) Add metrics to keep track of slow HLog appends


[ 
https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184288#comment-13184288
 ] 

Zhihong Yu commented on HBASE-3565:
---

Integrated to TRUNK.

Thanks for the patch Mubarak.

Thanks for the review, Stack.

 Add metrics to keep track of slow HLog appends
 --

 Key: HBASE-3565
 URL: https://issues.apache.org/jira/browse/HBASE-3565
 Project: HBase
  Issue Type: Improvement
  Components: metrics, regionserver
Reporter: Benoit Sigoure
Assignee: Mubarak Seyed
  Labels: monitoring
 Fix For: 0.94.0

 Attachments: HBASE-3565.trunk.v1.patch


 Whenever an edit takes too long to be written to an HLog, HBase logs a 
 warning such as this one:
 {code}
 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
 IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; 
 editcount=126050
 {code}
 I would like to have a counter incremented each time this happens and this 
 counter exposed via the metrics stuff in HBase so I can collect it in my 
 monitoring system.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3565) Add metrics to keep track of slow HLog appends

2012-01-11 Thread Zhihong Yu (Issue Comment Edited) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-3565:
--

Summary: Add metrics to keep track of slow HLog appends  (was: Add a metric 
to keep track of slow HLog appends)

 Add metrics to keep track of slow HLog appends
 --

 Key: HBASE-3565
 URL: https://issues.apache.org/jira/browse/HBASE-3565
 Project: HBase
  Issue Type: Improvement
  Components: metrics, regionserver
Reporter: Benoit Sigoure
Assignee: Mubarak Seyed
  Labels: monitoring
 Fix For: 0.94.0

 Attachments: HBASE-3565.trunk.v1.patch


 Whenever an edit takes too long to be written to an HLog, HBase logs a 
 warning such as this one:
 {code}
 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
 IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; 
 editcount=126050
 {code}
 I would like to have a counter incremented each time this happens and this 
 counter exposed via the metrics stuff in HBase so I can collect it in my 
 monitoring system.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184286#comment-13184286
 ] 

Zhihong Yu edited comment on HBASE-5179 at 1/11/12 7:07 PM:


I agree with the spirit of this class.  Good stuff Chunhui.

This is awkward name for a method, getDeadServersUnderProcessing.  Should it be 
getDeadServers?  Does it need to be a public method?  Seems fine that it be 
package private.

Is serversWithoutSplitLog a good name for a local variable?  Should it be 
deadServers with a comment saying that deadServers are processed by 
servershutdownhandler and it will be taking care of the log splitting?

Is this right -- for trunk?

{code}
-  } else if 
(!serverManager.isServerOnline(regionLocation.getServerName())) {
+  } else if (!onlineServers.contains(regionLocation.getHostname())) {
{code}
Online servers is keyed by a ServerName, not a hostname.

What is a deadServersUnderProcessing?  Does DeadServers keep list of all 
servers that ever died?  Is that a good idea?  Shouldn't finish remove item 
from deadservers rather than just from deadServersUnderProcessing

Change  name of this method, cloneProcessingDeadServers.  Just call it 
getDeadServers?  That its a clone is an internal implementation detail?






  was (Author: stack):
I agree with the spirit of this class.  Good stuff Chunhui.

This is awkward name for a method, getDeadServersUnderProcessing.  Should it be 
getDeadServers?  Does it need to be a public method?  Seems fine that it be 
package private.

Is serversWithoutSplitLog a good name for a local variable?  Should it be 
deadServers with a comment saying that deadServers are processed by 
servershutdownhandler and it will be taking care of the log splitting?

Is this right -- for trunk?

{code}
-  } else if 
(!serverManager.isServerOnline(regionLocation.getServerName())) {
+  } else if (!onlineServers.contains(regionLocation.getHostname())) {

Online servers is keyed by a ServerName, not a hostname.

What is a deadServersUnderProcessing?  Does DeadServers keep list of all 
servers that ever died?  Is that a good idea?  Shouldn't finish remove item 
from deadservers rather than just from deadServersUnderProcessing

Change  name of this method, cloneProcessingDeadServers.  Just call it 
getDeadServers?  That its a clone is an internal implementation detail?





  
 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4440) add an option to presplit table to PerformanceEvaluation

2012-01-11 Thread Jean-Daniel Cryans (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184294#comment-13184294
 ] 

Jean-Daniel Cryans commented on HBASE-4440:
---

bq. whether we use presplit option or not, table has to be recreated for all 
write-mode tests.

No, it shouldn't be different from the default behavior of not recreating the 
table.

bq. or pre-split should try to split the table without re-creating it.

It should not.

Code speaks more than words, here's what I'm using for testing 0.92:

{code}
  private boolean checkTable(HBaseAdmin admin) throws IOException {
HTableDescriptor tableDescriptor = getTableDescriptor();
boolean tableExists = admin.tableExists(tableDescriptor.getName());
if (!tableExists) {
  if (this.presplitRegions  0) {
byte[][] splits = getSplits();
for (int i=0; i  splits.length; i++) {
  LOG.debug( split  + i + :  + Bytes.toStringBinary(splits[i]));
}
admin.createTable(tableDescriptor, splits);
LOG.info (Table created with  + this.presplitRegions +  splits);
  }
  else {
admin.createTable(tableDescriptor);
LOG.info(Table  + tableDescriptor +  created);
  }
}
return !tableExists;
  }
{code}

 add an option to presplit table to PerformanceEvaluation
 

 Key: HBASE-4440
 URL: https://issues.apache.org/jira/browse/HBASE-4440
 Project: HBase
  Issue Type: Improvement
  Components: util
Reporter: Sujee Maniyam
Assignee: Sujee Maniyam
Priority: Minor
  Labels: benchmark
 Fix For: 0.94.0

 Attachments: PerformanceEvaluation.java, 
 PerformanceEvaluation_HBASE_4440.patch, 
 PerformanceEvaluation_HBASE_4440_2.patch


 PerformanceEvaluation a quick way to 'benchmark' a HBase cluster.  The 
 current 'write*' operations do not pre-split the table.  Pre splitting the 
 table will really boost the insert performance.
 It would be nice to have an option to enable pre-splitting table before the 
 inserts begin.
 it would look something like:
 (a) hbase ...PerformanceEvaluation   --presplit=10 other options
 (b) hbase ...PerformanceEvaluation   --presplit other options
 (b) will try to presplit the table on some default value (say number of 
 region servers)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184296#comment-13184296
 ] 

Zhihong Yu commented on HBASE-5179:
---

@Stack:
The following code is for 0.90 branch:
{code}
-  } else if 
(!serverManager.isServerOnline(regionLocation.getServerName())) {
+  } else if (!onlineServers.contains(regionLocation.getHostname())) {
{code}

I agree that serversWithoutSplitLog isn't a very good name. It holds both 
online servers and dead servers. How about naming it knownServers ?

ServerManager.java already has:
{code}
  public SetServerName getDeadServers() {
return this.deadservers.clone();
  }
{code}

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184303#comment-13184303
 ] 

Zhihong Yu commented on HBASE-5179:
---

TestRollingRestart fails in 0.90 with patch.

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4440) add an option to presplit table to PerformanceEvaluation

2012-01-11 Thread Sujee Maniyam (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184309#comment-13184309
]

Sujee Maniyam commented on HBASE-4440:
--

I see. looks good.
If the table exists, and presplit option is supplied, it will have no effect.
It might mislead the user in believing the pre-split option took effect, while
in fact it didn't.
may be a WARN would suffice to notify the user?

add an option to presplit table to PerformanceEvaluation

Attachments: PerformanceEvaluation.java,
PerformanceEvaluation_HBASE_4440.patch,
PerformanceEvaluation_HBASE_4440_2.patch

[jira] [Commented] (HBASE-4440) add an option to presplit table to PerformanceEvaluation

2012-01-11 Thread Jean-Daniel Cryans (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184311#comment-13184311
 ] 

Jean-Daniel Cryans commented on HBASE-4440:
---

We could show a WARN, but I don't think we would need more than that. In fact, 
we could always show a message when the table exists saying something like: 
Using the existing ${tablename} which has ${X} regions. 

About the pre-splitting itself, it seems that it creates N+1 regions and the 
first one has the end key 00 so it never gets data. Not a biggie, but 
could be fixed in another jira.

 add an option to presplit table to PerformanceEvaluation
 

 Key: HBASE-4440
 URL: https://issues.apache.org/jira/browse/HBASE-4440
 Project: HBase
  Issue Type: Improvement
  Components: util
Reporter: Sujee Maniyam
Assignee: Sujee Maniyam
Priority: Minor
  Labels: benchmark
 Fix For: 0.94.0

 Attachments: PerformanceEvaluation.java, 
 PerformanceEvaluation_HBASE_4440.patch, 
 PerformanceEvaluation_HBASE_4440_2.patch


 PerformanceEvaluation a quick way to 'benchmark' a HBase cluster.  The 
 current 'write*' operations do not pre-split the table.  Pre splitting the 
 table will really boost the insert performance.
 It would be nice to have an option to enable pre-splitting table before the 
 inserts begin.
 it would look something like:
 (a) hbase ...PerformanceEvaluation   --presplit=10 other options
 (b) hbase ...PerformanceEvaluation   --presplit other options
 (b) will try to presplit the table on some default value (say number of 
 region servers)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol

[
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184320#comment-13184320
]

Zhihong Yu commented on HBASE-5139:
---

Integrated to TRUNK.

Compute (weighted) median using AggregateProtocol
-

Key: HBASE-5139
URL: https://issues.apache.org/jira/browse/HBASE-5139
Project: HBase
Issue Type: Sub-task
Reporter: Zhihong Yu
Assignee: Zhihong Yu
Attachments: 5139-v2.txt

[jira] [Commented] (HBASE-3565) Add metrics to keep track of slow HLog appends


[ 
https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184325#comment-13184325
 ] 

Hudson commented on HBASE-3565:
---

Integrated in HBase-TRUNK #2619 (See 
[https://builds.apache.org/job/HBase-TRUNK/2619/])
HBASE-3565 Add metrics to keep track of slow HLog appends (Mubarak)

tedyu : 
Files : 
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/RegionServerMetrics.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java


 Add metrics to keep track of slow HLog appends
 --

 Key: HBASE-3565
 URL: https://issues.apache.org/jira/browse/HBASE-3565
 Project: HBase
  Issue Type: Improvement
  Components: metrics, regionserver
Reporter: Benoit Sigoure
Assignee: Mubarak Seyed
  Labels: monitoring
 Fix For: 0.94.0

 Attachments: HBASE-3565.trunk.v1.patch


 Whenever an edit takes too long to be written to an HLog, HBase logs a 
 warning such as this one:
 {code}
 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
 IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; 
 editcount=126050
 {code}
 I would like to have a counter incremented each time this happens and this 
 counter exposed via the metrics stuff in HBase so I can collect it in my 
 monitoring system.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3565) Add metrics to keep track of slow HLog appends


 [ 
https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-3565:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Add metrics to keep track of slow HLog appends
 --

 Key: HBASE-3565
 URL: https://issues.apache.org/jira/browse/HBASE-3565
 Project: HBase
  Issue Type: Improvement
  Components: metrics, regionserver
Reporter: Benoit Sigoure
Assignee: Mubarak Seyed
  Labels: monitoring
 Fix For: 0.94.0

 Attachments: HBASE-3565.trunk.v1.patch


 Whenever an edit takes too long to be written to an HLog, HBase logs a 
 warning such as this one:
 {code}
 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
 IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; 
 editcount=126050
 {code}
 I would like to have a counter incremented each time this happens and this 
 counter exposed via the metrics stuff in HBase so I can collect it in my 
 monitoring system.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol


[ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184367#comment-13184367
 ] 

Hudson commented on HBASE-5139:
---

Integrated in HBase-TRUNK-security #73 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/73/])
HBASE-5139 Compute (weighted) median using AggregateProtocol

tedyu : 
Files : 
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateProtocol.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/coprocessor/TestAggregateProtocol.java


 Compute (weighted) median using AggregateProtocol
 -

 Key: HBASE-5139
 URL: https://issues.apache.org/jira/browse/HBASE-5139
 Project: HBase
  Issue Type: Sub-task
Reporter: Zhihong Yu
Assignee: Zhihong Yu
 Attachments: 5139-v2.txt


 Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
 This task finds out the median value among the values of cf:cq1 (See 
 http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
 This can be done in two passes.
 The first pass utilizes AggregateProtocol where the following tuple is 
 returned from each region:
 (partial-sum-of-values, partial-sum-of-weights)
 The start rowkey (supplied by coprocessor framework) would be used to sort 
 the tuples. This way we can determine which region (called R) contains the 
 (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
 sought
 The second pass involves scanning the table, beginning with startrow of 
 region R and computing partial (weighted) sum until the threshold of S/2 is 
 crossed. The (weighted) median is returned.
 However, this approach wouldn't work if there is mutation in the underlying 
 table between pass one and pass two.
 In that case, sequential scanning seems to be the solution which is slower 
 than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.)


[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184366#comment-13184366
 ] 

Hudson commented on HBASE-5163:
---

Integrated in HBase-TRUNK-security #73 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/73/])
HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on 
Jenkins or hadoop QA (The directory is already locked.) (N Keywal)

tedyu : 
Files : 
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java


 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or 
 hadoop QA (The directory is already locked.)
 --

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5163.patch


 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last datanode, the directory will conflict on
 //  creation
 dfsCluster.stopDataNode(0);
 try {
   dfsCluster
 .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
   fail(There should be an exception because the directory already 
 exists);
 } catch (IOException e) {
   assertTrue( e.getMessage().contains(The directory is already 
 locked.));
   LOG.info(Expected (!) exception caught  + e.getMessage());
 }
 // Works, as we kill the last datanode, we can now restart 2 datanodes
 // This makes us back with 2 nodes
 dfsCluster.stopDataNode(0);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
   }
 {noformat}
 And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
 because when we do
 {noformat}
 DatanodeInfo[] pipeline = getPipeline(log);
 assertTrue(pipeline.length == fs.getDefaultReplication());
 {noformat}
 and then kill the datanodes in the pipeline, we will have:
  - most of the time: pipeline = 1  2, so

[jira] [Commented] (HBASE-5136) Redundant MonitoredTask instances in case of distributed log splitting retry


[ 
https://issues.apache.org/jira/browse/HBASE-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184377#comment-13184377
 ] 

Zhihong Yu commented on HBASE-5136:
---

Can someone review the patch ?

Thanks

 Redundant MonitoredTask instances in case of distributed log splitting retry
 

 Key: HBASE-5136
 URL: https://issues.apache.org/jira/browse/HBASE-5136
 Project: HBase
  Issue Type: Task
Reporter: Zhihong Yu
Assignee: Zhihong Yu
 Attachments: 5136.txt


 In case of log splitting retry, the following code would be executed multiple 
 times:
 {code}
   public long splitLogDistributed(final ListPath logDirs) throws 
 IOException {
 MonitoredTask status = TaskMonitor.get().createStatus(
   Doing distributed log split in  + logDirs);
 {code}
 leading to multiple MonitoredTask instances.
 User may get confused by multiple distributed log splitting entries for the 
 same region server on master UI

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol

2012-01-11 Thread jirapos...@reviews.apache.org (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184393#comment-13184393
 ] 

Hudson commented on HBASE-5139:
---

Integrated in HBase-TRUNK #2620 (See 
[https://builds.apache.org/job/HBase-TRUNK/2620/])
HBASE-5139 Compute (weighted) median using AggregateProtocol

tedyu : 
Files : 
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateProtocol.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/coprocessor/TestAggregateProtocol.java


 Compute (weighted) median using AggregateProtocol
 -

 Key: HBASE-5139
 URL: https://issues.apache.org/jira/browse/HBASE-5139
 Project: HBase
  Issue Type: Sub-task
Reporter: Zhihong Yu
Assignee: Zhihong Yu
 Attachments: 5139-v2.txt


 Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. 
 This task finds out the median value among the values of cf:cq1 (See 
 http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)
 This can be done in two passes.
 The first pass utilizes AggregateProtocol where the following tuple is 
 returned from each region:
 (partial-sum-of-values, partial-sum-of-weights)
 The start rowkey (supplied by coprocessor framework) would be used to sort 
 the tuples. This way we can determine which region (called R) contains the 
 (weighted) median. partial-sum-of-weights can be 0 if unweighted median is 
 sought
 The second pass involves scanning the table, beginning with startrow of 
 region R and computing partial (weighted) sum until the threshold of S/2 is 
 crossed. The (weighted) median is returned.
 However, this approach wouldn't work if there is mutation in the underlying 
 table between pass one and pass two.
 In that case, sequential scanning seems to be the solution which is slower 
 than the above approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5128) [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online.


[ 
https://issues.apache.org/jira/browse/HBASE-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184400#comment-13184400
 ] 

jirapos...@reviews.apache.org commented on HBASE-5128:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3435/#review4317
---



src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java
https://reviews.apache.org/r/3435/#comment9714

Should be 'to end key'.



src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java
https://reviews.apache.org/r/3435/#comment9715

Should insert some text between newRegion and region.



src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java
https://reviews.apache.org/r/3435/#comment9716

This should be outside the for loop.



src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java
https://reviews.apache.org/r/3435/#comment9717

Space between  and 0.


- Ted


On 2012-01-11 12:46:37, jmhsieh wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3435/
bq.  ---
bq.  
bq.  (Updated 2012-01-11 12:46:37)
bq.  
bq.  
bq.  Review request for hbase, Todd Lipcon, Ted Yu, Michael Stack, and 
Jean-Daniel Cryans.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  I'm posting a preliminary version that I'm currently testing on real 
clusters. The tests are flakey on the 0.90 branch (so there is something async 
that I didn't synchronize properly), and there are a few more TODO's I want to 
knock out before this is ready for full review to be considered for committing. 
It's got some problems I need some advice figuring out.
bq.  
bq.  Problem 1:
bq.  
bq.  In the unit tests, I have a few cases where I fabricate new regions and 
try to force the overlapping regions to be closed. For some of these, I cannot 
delete a table after it is repaired without causing subsequent tests to fail. I 
think this is due to a few things:
bq.  
bq.  1) The disable table handler uses in-memory assignment manager state while 
delete uses in META assignment information.
bq.  2) Currently I'm using the sneaky closeRegion that purposely doesn't go 
through the master and in turn doesn't modify in-memory state – disable uses 
out of date in-memory region assignments. If I use the unassign method sends 
RIT transitions to the master, but which ends up attempting to assign it again, 
causing timing/transient states.
bq.  
bq.  What is a good way to clear the HMaster's assignment manager's assignment 
data for particular regions or to force it to re-read from META? (without 
modifying the 0.90 HBase's it is meant to repair).
bq.  
bq.  Problem 2:
bq.  
bq.  Sometimes test fail reporting HOLE_IN_REGION_CHAIN and 
SERVER_DOES_NOT_MATCH_META. This means the old and new regions are confiused 
with each other and basically something is still happening asynchronously. I 
think this is the new region is being assigned and is still transitioning. 
Sound about right? To make the unit test deterministic, should hbck wait for 
these to settle or should just the unit test wait?
bq.  
bq.  
bq.  This addresses bug HBASE-5128.
bq.  https://issues.apache.org/jira/browse/HBASE-5128
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java 6d3401d 
bq.src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java a3d8b8b 
bq.src/main/java/org/apache/hadoop/hbase/util/hbck/OfflineMetaRepair.java 
29e8bb2 
bq.
src/main/java/org/apache/hadoop/hbase/util/hbck/TableIntegrityErrorHandler.java 
PRE-CREATION 
bq.src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java a640d57 
bq.src/test/java/org/apache/hadoop/hbase/util/hbck/HbckTestingUtil.java 
dbb97f8 
bq.
src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildBase.java 
3e8729d 
bq.
src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildHole.java 
11a1151 
bq.
src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildOverlap.java
 4a09ce2 
bq.  
bq.  Diff: https://reviews.apache.org/r/3435/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  All unit tests pass sometimes.  Some fail sometimes (generally the cases 
that fabricate new regions).  
bq.  
bq.  Not ready for commit.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  jmhsieh
bq.  
bq.



 [uber hbck] Enable hbck to automatically repair table integrity problems as 
 well as region consistency problems while online.
 -

 Key: HBASE-5128
 URL:

[jira] [Updated] (HBASE-5167) We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that.


 [ 
https://issues.apache.org/jira/browse/HBASE-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5167:
-

  Resolution: Fixed
Assignee: Harsh J
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Committed trunk.  Thanks Harsh.

 We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing 
 that.
 --

 Key: HBASE-5167
 URL: https://issues.apache.org/jira/browse/HBASE-5167
 Project: HBase
  Issue Type: Improvement
  Components: scripts
Affects Versions: 0.92.0
Reporter: Harsh J
Assignee: Harsh J
Priority: Trivial
 Fix For: 0.94.0

 Attachments: HBASE-5167.patch


 HBASE-4209 changed the behavior of the scripts such that we do not kill the 
 daemons away anymore. We should have also changed the message shown in the 
 logs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5168) Backport HBASE-5100 - Rollback of split could cause closed region to be opened again


[ 
https://issues.apache.org/jira/browse/HBASE-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184424#comment-13184424
 ] 

stack commented on HBASE-5168:
--

+1

 Backport HBASE-5100 - Rollback of split could cause closed region to be 
 opened again
 

 Key: HBASE-5168
 URL: https://issues.apache.org/jira/browse/HBASE-5168
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
 Attachments: HBASE-5100_0.90.patch


 Considering the importance of the defect merging it to 0.90.6

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-5180) [book] book.xml - fixed scanner example

2012-01-11 Thread Doug Meil (Created) (JIRA)

[book] book.xml - fixed scanner example
---

 Key: HBASE-5180
 URL: https://issues.apache.org/jira/browse/HBASE-5180
 Project: HBase
  Issue Type: Bug
Reporter: Doug Meil
Assignee: Doug Meil
 Attachments: book_HBASE_5180.xml.patch

book.xml - the scanner example wasn't closing the scanner!  that's bad practice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5180) [book] book.xml - fixed scanner example


 [ 
https://issues.apache.org/jira/browse/HBASE-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Meil updated HBASE-5180:
-

Status: Patch Available  (was: Open)

 [book] book.xml - fixed scanner example
 ---

 Key: HBASE-5180
 URL: https://issues.apache.org/jira/browse/HBASE-5180
 Project: HBase
  Issue Type: Bug
Reporter: Doug Meil
Assignee: Doug Meil
 Attachments: book_HBASE_5180.xml.patch


 book.xml - the scanner example wasn't closing the scanner!  that's bad 
 practice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5180) [book] book.xml - fixed scanner example


 [ 
https://issues.apache.org/jira/browse/HBASE-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Meil updated HBASE-5180:
-

Description: book.xml - the scanner example wasn't closing the 
ResultScanner!  that's bad practice.  (was: book.xml - the scanner example 
wasn't closing the scanner!  that's bad practice.)

 [book] book.xml - fixed scanner example
 ---

 Key: HBASE-5180
 URL: https://issues.apache.org/jira/browse/HBASE-5180
 Project: HBase
  Issue Type: Bug
Reporter: Doug Meil
Assignee: Doug Meil
 Attachments: book_HBASE_5180.xml.patch


 book.xml - the scanner example wasn't closing the ResultScanner!  that's bad 
 practice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5180) [book] book.xml - fixed scanner example


 [ 
https://issues.apache.org/jira/browse/HBASE-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Meil updated HBASE-5180:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 [book] book.xml - fixed scanner example
 ---

 Key: HBASE-5180
 URL: https://issues.apache.org/jira/browse/HBASE-5180
 Project: HBase
  Issue Type: Bug
Reporter: Doug Meil
Assignee: Doug Meil
 Attachments: book_HBASE_5180.xml.patch


 book.xml - the scanner example wasn't closing the ResultScanner!  that's bad 
 practice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5129) book is inconsistent regarding disabling - major compaction

[
https://issues.apache.org/jira/browse/HBASE-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doug Meil updated HBASE-5129:
-

Assignee: Doug Meil

book is inconsistent regarding disabling - major compaction
---

Key: HBASE-5129
URL: https://issues.apache.org/jira/browse/HBASE-5129
Project: HBase
Issue Type: Bug
Components: documentation
Affects Versions: 0.90.1
Reporter: Mikael Sitruk
Assignee: Doug Meil
Priority: Minor

It seems that the book has some inconsistencies regarding the way to disable
major compactions
According to the book in chapter 2.6.1.1. HBase Default Configuration
hbase.hregion.majorcompaction - The time (in miliseconds) between 'major'
compactions of all HStoreFiles in a region. Default: 1 day. Set to 0 to
disable automated major compactions.
Default: 8640
(http://hbase.apache.org/book.html#hbase_default_configurations)
According to the book at chapter 2.8.2.8. Managed Compactions
A common administrative technique is to manage major compactions manually,
rather than letting HBase do it. By default,
HConstants.MAJOR_COMPACTION_PERIOD is one day and major compactions may kick
in when you least desire it - especially on a busy system. To turn off
automatic major compactions set the value to Long.MAX_VALUE.
According to the code org.apache.hadoop.hbase.regionserver.Store.java, 0 is
the right answer.
(affect all documentation from 0.90.1)

[jira] [Updated] (HBASE-5129) book is inconsistent regarding disabling - major compaction

[
https://issues.apache.org/jira/browse/HBASE-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doug Meil updated HBASE-5129:
-

Attachment: configuration_HBASE_5129.xml.patch

book is inconsistent regarding disabling - major compaction
---

[jira] [Updated] (HBASE-5129) book is inconsistent regarding disabling - major compaction

[
https://issues.apache.org/jira/browse/HBASE-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doug Meil updated HBASE-5129:
-

Resolution: Fixed
Status: Resolved (was: Patch Available)

book is inconsistent regarding disabling - major compaction
---

[jira] [Commented] (HBASE-5129) book is inconsistent regarding disabling - major compaction

2012-01-11 Thread Doug Meil (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184437#comment-13184437
]

Doug Meil commented on HBASE-5129:
--

Thanks for the catch Mikael!

book is inconsistent regarding disabling - major compaction
---

[jira] [Updated] (HBASE-5129) book is inconsistent regarding disabling - major compaction

[
https://issues.apache.org/jira/browse/HBASE-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doug Meil updated HBASE-5129:
-

Status: Patch Available (was: Open)

book is inconsistent regarding disabling - major compaction
---

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5179:
--

Attachment: (was: 5179-90.txt)

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5179:
--

Attachment: 5179-90.txt

New patch for 0.90
Now TestRollingRestart passes.

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5179:
--

Comment: was deleted

(was: TestRollingRestart fails in 0.90 with patch.)

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184448#comment-13184448
 ] 

Zhihong Yu commented on HBASE-5179:
---

I think the reason Chunhui introduced a new Set for the dead servers being 
processed is that DeadServer is supposed to remember dead servers:
{code}
   * Set of known dead servers.  On znode expiration, servers are added here.
{code}
DeadServer.cleanPreviousInstance() is called by ServerManager.checkIsDead() 
when the server becomes live again.

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

[
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184450#comment-13184450
]

Hadoop QA commented on HBASE-5179:
--

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12510261/5179-90.txt
against trunk revision .

+1 @author. The patch does not contain any @author tags.

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/733//console

This message is automatically generated.

Concurrent processing of processFaileOver and ServerShutdownHandler may
cause region is assigned before completing split log, it would cause data loss
---

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

[
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhihong Yu updated HBASE-5179:
--

Comment: was deleted

(was: -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12510261/5179-90.txt
against trunk revision .

+1 @author. The patch does not contain any @author tags.

-1 patch. The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/733//console

This message is automatically generated.)

Concurrent processing of processFaileOver and ServerShutdownHandler may
cause region is assigned before completing split log, it would cause data loss
---

[jira] [Created] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly

2012-01-11 Thread Scott Chen (Created) (JIRA)

TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
--

 Key: HBASE-5182
 URL: https://issues.apache.org/jira/browse/HBASE-5182
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Scott Chen
Priority: Minor


TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. 
It uses the default value instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5179:
--

Attachment: 5179-v3.txt

Patch v3 addresses Stack's comments

Some names are open to suggestion.

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, 5179-v3.txt, hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly

2012-01-11 Thread Scott Chen (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Chen updated HBASE-5182:
--

Attachment: hbase-5182.txt

 TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
 --

 Key: HBASE-5182
 URL: https://issues.apache.org/jira/browse/HBASE-5182
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Scott Chen
Priority: Minor
 Attachments: hbase-5182.txt


 TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. 
 It uses the default value instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly


[ 
https://issues.apache.org/jira/browse/HBASE-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184474#comment-13184474
 ] 

Zhihong Yu commented on HBASE-5182:
---

+1 on patch.

 TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
 --

 Key: HBASE-5182
 URL: https://issues.apache.org/jira/browse/HBASE-5182
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Scott Chen
Priority: Minor
 Attachments: hbase-5182.txt


 TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. 
 It uses the default value instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5181) Improve error message when Master fail-over happens and ZK unassigned node contains stale znode(s)


[ 
https://issues.apache.org/jira/browse/HBASE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184472#comment-13184472
 ] 

Zhihong Yu commented on HBASE-5181:
---

Thanks for the suggestion, Mubarak.

Do you want to attach a patch ?

 Improve error message when Master fail-over happens and ZK unassigned node 
 contains stale znode(s)
 --

 Key: HBASE-5181
 URL: https://issues.apache.org/jira/browse/HBASE-5181
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0, 0.90.5
Reporter: Mubarak Seyed
Priority: Minor
  Labels: noob

 When master fail-over happens, if we have number of RITs under 
 /hbase/unassigned and if we have stale znode(s) (encoded region names) under 
 /hbase/unassigned, we are getting
 {code}
 2011-12-30 10:27:35,623 INFO org.apache.hadoop.hbase.master.HMaster: Master 
 startup proceeding: master failover 
 2011-12-30 10:27:36,002 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to 
 process 1717 regions in transition 
 2011-12-30 10:27:36,004 FATAL org.apache.hadoop.hbase.master.HMaster: 
 Unhandled exception. Starting shutdown. 
 java.lang.ArrayIndexOutOfBoundsException: -256 
 at 
 org.apache.hadoop.hbase.executor.RegionTransitionData.readFields(RegionTransitionData.java:148)
  
 at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:105) 
 at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) 
 at 
 org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)
  
 at org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:743) 
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processRegionInTransition(AssignmentManager.java:262)
  
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:223)
  
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:401) 
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}
 and there is no clue on how to clean-up the stale znode(s) from unassigned 
 using zkCli.sh (del /hbase/unassigned/bad region name). It would be good if 
 we include the bad region name in IOException from 
 RegionTransitionData.readFields().
 {code}
 @Override
   public void readFields(DataInput in) throws IOException {
 // the event type byte
 eventType = EventType.values()[in.readShort()];
 // the timestamp
 stamp = in.readLong();
 // the encoded name of the region being transitioned
 regionName = Bytes.readByteArray(in);
 // remaining fields are optional so prefixed with boolean
 // the name of the regionserver sending the data
 if (in.readBoolean()) {
   byte [] versionedBytes = Bytes.readByteArray(in);
   this.origin = ServerName.parseVersionedServerName(versionedBytes);
 }
 if (in.readBoolean()) {
   this.payload = Bytes.readByteArray(in);
 }
   }
 {code}
 If the code execution has survived until regionName then we can include the 
 regionName in IOException with error message to clean-up the stale znode(s) 
 under /hbase/unassigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly

2012-01-11 Thread Zhihong Yu (Assigned) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu reassigned HBASE-5182:
-

Assignee: Scott Chen

 TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
 --

 Key: HBASE-5182
 URL: https://issues.apache.org/jira/browse/HBASE-5182
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Scott Chen
Assignee: Scott Chen
Priority: Minor
 Fix For: 0.94.0

 Attachments: hbase-5182.txt


 TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. 
 It uses the default value instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly


 [ 
https://issues.apache.org/jira/browse/HBASE-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5182:
--

Status: Patch Available  (was: Open)

 TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
 --

 Key: HBASE-5182
 URL: https://issues.apache.org/jira/browse/HBASE-5182
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Scott Chen
Assignee: Scott Chen
Priority: Minor
 Fix For: 0.94.0

 Attachments: hbase-5182.txt


 TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. 
 It uses the default value instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5181) Improve error message when Master fail-over happens and ZK unassigned node contains stale znode(s)

2012-01-11 Thread Mubarak Seyed (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184478#comment-13184478
 ] 

Mubarak Seyed commented on HBASE-5181:
--

Working on corporate approval to contribute this patch. Thanks.

 Improve error message when Master fail-over happens and ZK unassigned node 
 contains stale znode(s)
 --

 Key: HBASE-5181
 URL: https://issues.apache.org/jira/browse/HBASE-5181
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0, 0.90.5
Reporter: Mubarak Seyed
Priority: Minor
  Labels: noob

 When master fail-over happens, if we have number of RITs under 
 /hbase/unassigned and if we have stale znode(s) (encoded region names) under 
 /hbase/unassigned, we are getting
 {code}
 2011-12-30 10:27:35,623 INFO org.apache.hadoop.hbase.master.HMaster: Master 
 startup proceeding: master failover 
 2011-12-30 10:27:36,002 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to 
 process 1717 regions in transition 
 2011-12-30 10:27:36,004 FATAL org.apache.hadoop.hbase.master.HMaster: 
 Unhandled exception. Starting shutdown. 
 java.lang.ArrayIndexOutOfBoundsException: -256 
 at 
 org.apache.hadoop.hbase.executor.RegionTransitionData.readFields(RegionTransitionData.java:148)
  
 at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:105) 
 at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) 
 at 
 org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)
  
 at org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:743) 
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processRegionInTransition(AssignmentManager.java:262)
  
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:223)
  
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:401) 
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}
 and there is no clue on how to clean-up the stale znode(s) from unassigned 
 using zkCli.sh (del /hbase/unassigned/bad region name). It would be good if 
 we include the bad region name in IOException from 
 RegionTransitionData.readFields().
 {code}
 @Override
   public void readFields(DataInput in) throws IOException {
 // the event type byte
 eventType = EventType.values()[in.readShort()];
 // the timestamp
 stamp = in.readLong();
 // the encoded name of the region being transitioned
 regionName = Bytes.readByteArray(in);
 // remaining fields are optional so prefixed with boolean
 // the name of the regionserver sending the data
 if (in.readBoolean()) {
   byte [] versionedBytes = Bytes.readByteArray(in);
   this.origin = ServerName.parseVersionedServerName(versionedBytes);
 }
 if (in.readBoolean()) {
   this.payload = Bytes.readByteArray(in);
 }
   }
 {code}
 If the code execution has survived until regionName then we can include the 
 regionName in IOException with error message to clean-up the stale znode(s) 
 under /hbase/unassigned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly


 [ 
https://issues.apache.org/jira/browse/HBASE-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-5182:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk.  Thanks for the patch Scott.

 TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
 --

 Key: HBASE-5182
 URL: https://issues.apache.org/jira/browse/HBASE-5182
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Scott Chen
Assignee: Scott Chen
Priority: Minor
 Fix For: 0.94.0

 Attachments: hbase-5182.txt


 TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. 
 It uses the default value instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5167) We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that.