[jira] [Commented] (HBASE-11394) Replication can have data loss if peer id contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-11394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041580#comment-14041580 ] Jieshan Bean commented on HBASE-11394: -- I have the same doubt. We have added a restriction to the peer-id name in our private version. Replication can have data loss if peer id contains hyphen - - Key: HBASE-11394 URL: https://issues.apache.org/jira/browse/HBASE-11394 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Fix For: 0.99.0, 0.98.4 This is an extension to HBASE-8207. It seems that there is no check for the peer id string (which is the short name for the replication peer) format. So in case a peer id containing -, it will cause data loss silently on server failure. I did not verify the claim via testing though, this is just purely from reading the code. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11344) Hide row keys and such from the web UIs
[ https://issues.apache.org/jira/browse/HBASE-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14038214#comment-14038214 ] Jieshan Bean commented on HBASE-11344: -- +1 on this idea. We are suffering the same security problem. Hide row keys and such from the web UIs --- Key: HBASE-11344 URL: https://issues.apache.org/jira/browse/HBASE-11344 Project: HBase Issue Type: Improvement Reporter: Devaraj Das Fix For: 0.99.0 The table details on the master UI lists the start row keys of the regions. The row keys might have sensitive data. We should hide them based on whether or not the user accessing has the required authorization to view the table.. To start with, we could make the display of row keys and such based on a configuration being true or false. If it is false, such potentially sensitive data is never displayed on the web UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (HBASE-9081) Online split for an reserved empty region
[ https://issues.apache.org/jira/browse/HBASE-9081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean reassigned HBASE-9081: --- Assignee: Jieshan Bean Online split for an reserved empty region - Key: HBASE-9081 URL: https://issues.apache.org/jira/browse/HBASE-9081 Project: HBase Issue Type: New Feature Components: master, regionserver Reporter: Jieshan Bean Assignee: Jieshan Bean We already have a region splitter tool. But it can only provide limited functions: 1. Create table with a specified region number without give any splits. 2. Roll-Split on an exist region. We have such user scenario: Table was created with splits like below: abcdefgo g~o is a reserved empty region. Will use it only after some days. So we don't know the rowkey distribution currently. Will split it only when it get used. Say, we want to split g~o with 10 new regions, likes g, g1, g2, g3, g4, g5...,g9, o. I didn't find similar function has already been there. Please tell me if I am wrong. Hope to hear your ideas on this:) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9081) Online split for an reserved empty region
[ https://issues.apache.org/jira/browse/HBASE-9081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-9081: Summary: Online split for an reserved empty region (was: Online split for an reserved empty region with splits = 1) Online split for an reserved empty region - Key: HBASE-9081 URL: https://issues.apache.org/jira/browse/HBASE-9081 Project: HBase Issue Type: New Feature Components: master, regionserver Reporter: Jieshan Bean We already have a region splitter tool. But it can only provide limited functions: 1. Create table with a specified region number without give any splits. 2. Roll-Split on an exist region. We have such user scenario: Table was created with splits like below: abcdefgo g~o is a reserved empty region. Will use it only after some days. So we don't know the rowkey distribution currently. Will split it only when it get used. Say, we want to split g~o with 10 new regions, likes g, g1, g2, g3, g4, g5...,g9, o. I didn't find similar function has already been there. Please tell me if I am wrong. Hope to hear your ideas on this:) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9081) Online split for an reserved empty region with splits = 1
[ https://issues.apache.org/jira/browse/HBASE-9081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-9081: Summary: Online split for an reserved empty region with splits = 1 (was: Online split for an reserved empty region with splits 1) Online split for an reserved empty region with splits = 1 -- Key: HBASE-9081 URL: https://issues.apache.org/jira/browse/HBASE-9081 Project: HBase Issue Type: New Feature Components: master, regionserver Reporter: Jieshan Bean We already have a region splitter tool. But it can only provide limited functions: 1. Create table with a specified region number without give any splits. 2. Roll-Split on an exist region. We have such user scenario: Table was created with splits like below: abcdefgo g~o is a reserved empty region. Will use it only after some days. So we don't know the rowkey distribution currently. Will split it only when it get used. Say, we want to split g~o with 10 new regions, likes g, g1, g2, g3, g4, g5...,g9, o. I didn't find similar function has already been there. Please tell me if I am wrong. Hope to hear your ideas on this:) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9081) Online split for an reserved empty region with splits 1
Jieshan Bean created HBASE-9081: --- Summary: Online split for an reserved empty region with splits 1 Key: HBASE-9081 URL: https://issues.apache.org/jira/browse/HBASE-9081 Project: HBase Issue Type: New Feature Components: master, regionserver Reporter: Jieshan Bean We already have a region splitter tool. But it can only provide limited functions: 1. Create table with a specified region number without give any splits. 2. Roll-Split on an exist region. We have such user scenario: Table was created with splits like below: abcdefgo g~o is a reserved empty region. Will use it only after some days. So we don't know the rowkey distribution currently. Will split it only when it get used. Say, we want to split g~o with 10 new regions, likes g, g1, g2, g3, g4, g5...,g9, o. I didn't find similar function has already been there. Please tell me if I am wrong. Hope to hear your ideas on this:) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8927) Use nano time instead of mili time everywhere
[ https://issues.apache.org/jira/browse/HBASE-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13705568#comment-13705568 ] Jieshan Bean commented on HBASE-8927: - I think System.nanoTime() can not be used as TimeStamp. It can't ensure the accuracy. Use nano time instead of mili time everywhere - Key: HBASE-8927 URL: https://issues.apache.org/jira/browse/HBASE-8927 Project: HBase Issue Type: Bug Reporter: stack Attachments: 8927.txt Less collisions and we are paying the price of a long anyways so might as well fill it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8892) should pick the files as older as possible also while hasReferences
[ https://issues.apache.org/jira/browse/HBASE-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701845#comment-13701845 ] Jieshan Bean commented on HBASE-8892: - Generally speak, older file is bigger than the new one, and maybe include reference files at the beginning. So I don't think this change is reasonable if I understand the code correctly:). [~xieliang007], any other reasons? should pick the files as older as possible also while hasReferences --- Key: HBASE-8892 URL: https://issues.apache.org/jira/browse/HBASE-8892 Project: HBase Issue Type: Bug Components: Compaction Affects Versions: 0.94.9 Reporter: Liang Xie Assignee: Liang Xie Priority: Minor Attachments: HBase-8892-0.94.txt Currently, while hasReferences for compactSelection, and if compactSelection.getFilesToCompact() has more than maxFilesToCompact files, we clear the files from beginning, it's different with the normal minor compaction ratio based policy, which tries to do compactSelection from older to newer ones as possible. {code} } else if (compactSelection.getFilesToCompact().size() this.maxFilesToCompact) { // all files included in this compaction, up to max int pastMax = compactSelection.getFilesToCompact().size() - this.maxFilesToCompact; compactSelection.getFilesToCompact().subList(0, pastMax).clear(); {code} It makes the beginning files more difficult to be picked up in future's minor compaction stage. IMHO, it should be like this: {code} compactSelection.getFilesToCompact() .subList(this.maxFilesToCompact, compactSelection.getFilesToCompact().size()) .clear(); {code} It's not a big issue, since occurs while hasReferences returns true only. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8772) Separate Replication from HBase RegionServer process
[ https://issues.apache.org/jira/browse/HBASE-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692886#comment-13692886 ] Jieshan Bean commented on HBASE-8772: - Aggreed with J-D. Below is some key points need to consider in my own understanding: 1. Seperated process only for ReplicationSource? Meanwhile, ReplicationSink could also be impacted by GC triggered by RegionServer, although ReplicationSink is not a seperated thread currently. 2. Need to introduce a new RPC interface? RegionInterface can not be used any more. 3. Need to track logs itself. 4. Queue-Failover is more complicated. Since RegionServer may have aborted but Replication process still be there, and vice versa. So each replication process should be registered in ZooKeeper, and tracked by each RegionServer. 5. Support for security. Separate Replication from HBase RegionServer process Key: HBASE-8772 URL: https://issues.apache.org/jira/browse/HBASE-8772 Project: HBase Issue Type: New Feature Components: regionserver, Replication Reporter: Sameer Vaishampayan Labels: performance Replication is a separate functionality than managing regions and should be able to be managed separately as a service rather than rolled into RegionServer. Load on RegionServer, gc etc shouldn't affect the replication service. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8774) Add BatchSize and Filter to Thrift2
[ https://issues.apache.org/jira/browse/HBASE-8774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689107#comment-13689107 ] Jieshan Bean commented on HBASE-8774: - See HBASE-6073. It's also regarding on adding filter to Thrift2. 1 minor problem in the patch: {code} +boolean this_present_filterString = true this.isSetFilterString(); +boolean that_present_filterString = true that.isSetFilterString(); {code} true is redundant. In addition, I suggest to add 1 unit test. Anyway, it's a nice patch. Add BatchSize and Filter to Thrift2 --- Key: HBASE-8774 URL: https://issues.apache.org/jira/browse/HBASE-8774 Project: HBase Issue Type: New Feature Components: Thrift Affects Versions: 0.95.1 Reporter: Hamed Madani Attachments: HBASE_8774.patch Attached Patch will add BatchSize and Filter support to Thrift2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8562) Close readers after compaction
[ https://issues.apache.org/jira/browse/HBASE-8562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13661258#comment-13661258 ] Jieshan Bean commented on HBASE-8562: - StoreFileScanners use a shared reader on store file level. so it does not need to close the reader. right? Close readers after compaction -- Key: HBASE-8562 URL: https://issues.apache.org/jira/browse/HBASE-8562 Project: HBase Issue Type: Bug Reporter: Amitanand Aiyer Assignee: Amitanand Aiyer Priority: Trivial StoreFileScanners open readers to read the store file. However, these readers are not closed upon StoreFileScanner.close(). This should be closed at the end of the compaction. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8563) Double count of read requests for Gets
[ https://issues.apache.org/jira/browse/HBASE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13660218#comment-13660218 ] Jieshan Bean commented on HBASE-8563: - Makes sense to me. +1. Double count of read requests for Gets --- Key: HBASE-8563 URL: https://issues.apache.org/jira/browse/HBASE-8563 Project: HBase Issue Type: Bug Reporter: Francis Liu Assignee: Francis Liu Fix For: 0.94.7 Attachments: HBASE-8563_94.patch Whenever a RegionScanner is created via HRegion.getScanner(), the read request count is incremented. Since get is implemented as a scan internally. Each Get request is counted twice. Scans will have an extra count as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8253) A corrupted log blocked ReplicationSource
[ https://issues.apache.org/jira/browse/HBASE-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13652954#comment-13652954 ] Jieshan Bean commented on HBASE-8253: - bq.the edit was then written into the next log and durability was ensured? Yes. bq.So we just need to skip over this one? Yes. Just skip this one:) A corrupted log blocked ReplicationSource - Key: HBASE-8253 URL: https://issues.apache.org/jira/browse/HBASE-8253 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8253-94.patch A writting log got corrupted when we forcely power down one node. Only partial of last WALEdit was written into that log. And that log was not the last one in replication queue. ReplicationSource was blocked under this scenario. A lot of logs like below were printed: {noformat} 2013-03-30 06:53:48,628 WARN [regionserver26003-EventThread.replicationSource,1] 1 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:334) java.io.EOFException: hdfs://hacluster/hbase/.logs/master11,26003,1364530862620/master11%2C26003%2C1364530862620.1364553936510, entryStart=40434738, pos=40450048, end=40450048, edit=0 at sun.reflect.GeneratedConstructorAccessor42.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:295) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:240) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:84) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:412) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:330) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2282) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2181) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2227) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:238) ... 3 more .. 2013-03-30 06:54:38,899 WARN [regionserver26003-EventThread.replicationSource,1] 1 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:334) java.io.EOFException: hdfs://hacluster/hbase/.logs/master11,26003,1364530862620/master11%2C26003%2C1364530862620.1364553936510, entryStart=40434738, pos=40450048, end=40450048, edit=0 at sun.reflect.GeneratedConstructorAccessor42.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:295) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:240) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:84) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:412) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:330) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2282) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2181) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2227) at
[jira] [Commented] (HBASE-8253) A corrupted log blocked ReplicationSource
[ https://issues.apache.org/jira/browse/HBASE-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648093#comment-13648093 ] Jieshan Bean commented on HBASE-8253: - bq.how come you got this error in a normal source? Yes, it happened in a normal source not a recovered one. Primary data node was forcely powered down, so logRoll was requested. And during that time, only partial of last edit was written. Then we saw this problem. A corrupted log blocked ReplicationSource - Key: HBASE-8253 URL: https://issues.apache.org/jira/browse/HBASE-8253 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8253-94.patch A writting log got corrupted when we forcely power down one node. Only partial of last WALEdit was written into that log. And that log was not the last one in replication queue. ReplicationSource was blocked under this scenario. A lot of logs like below were printed: {noformat} 2013-03-30 06:53:48,628 WARN [regionserver26003-EventThread.replicationSource,1] 1 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:334) java.io.EOFException: hdfs://hacluster/hbase/.logs/master11,26003,1364530862620/master11%2C26003%2C1364530862620.1364553936510, entryStart=40434738, pos=40450048, end=40450048, edit=0 at sun.reflect.GeneratedConstructorAccessor42.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:295) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:240) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:84) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:412) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:330) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2282) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2181) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2227) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:238) ... 3 more .. 2013-03-30 06:54:38,899 WARN [regionserver26003-EventThread.replicationSource,1] 1 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:334) java.io.EOFException: hdfs://hacluster/hbase/.logs/master11,26003,1364530862620/master11%2C26003%2C1364530862620.1364553936510, entryStart=40434738, pos=40450048, end=40450048, edit=0 at sun.reflect.GeneratedConstructorAccessor42.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:295) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:240) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:84) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:412) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:330) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2282) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2181) at
[jira] [Commented] (HBASE-6428) Pluggable Compaction policies
[ https://issues.apache.org/jira/browse/HBASE-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641303#comment-13641303 ] Jieshan Bean commented on HBASE-6428: - Thanks for your reply, this is really a good feature:) Pluggable Compaction policies - Key: HBASE-6428 URL: https://issues.apache.org/jira/browse/HBASE-6428 Project: HBase Issue Type: New Feature Reporter: Lars Hofhansl For some usecases is useful to allow more control over how KVs get compacted. For example one could envision storing old versions of a KV separate HFiles, which then rarely have to be touched/cached by queries querying for new data. In addition these date ranged HFile can be easily used for backups while maintaining historical data. This would be a major change, allowing compactions to provide multiple targets (not just a filter). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6428) Pluggable Compaction policies
[ https://issues.apache.org/jira/browse/HBASE-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13638881#comment-13638881 ] Jieshan Bean commented on HBASE-6428: - [~lhofhansl] Any updates on this? Pluggable Compaction policies - Key: HBASE-6428 URL: https://issues.apache.org/jira/browse/HBASE-6428 Project: HBase Issue Type: New Feature Reporter: Lars Hofhansl For some usecases is useful to allow more control over how KVs get compacted. For example one could envision storing old versions of a KV separate HFiles, which then rarely have to be touched/cached by queries querying for new data. In addition these date ranged HFile can be easily used for backups while maintaining historical data. This would be a major change, allowing compactions to provide multiple targets (not just a filter). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8361) Bulk load and other utilities should not create tables for user
[ https://issues.apache.org/jira/browse/HBASE-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634851#comment-13634851 ] Jieshan Bean commented on HBASE-8361: - bq.The tools should error when the destination table does not exist. +1. Such tools should not create a table silently which may not be the expected table. Bulk load and other utilities should not create tables for user --- Key: HBASE-8361 URL: https://issues.apache.org/jira/browse/HBASE-8361 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Nick Dimiduk {{LoadIncrementalHFiles}} and {{ImportTsv}} will create a table with the default setting when the target table does not exist. I think this is an anti-feature. Neither tool provide a mechanism for the user to configure the creation parameters of that table, resulting in a new table with the default settings. I think it is unlikely that the default settings are what the user actually wants. In the event of a table-name typo, that means data is silently loaded into the wrong place. The tools should error when the destination table does not exist. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8336) PooledHTable may be returned multiple times to the same pool
[ https://issues.apache.org/jira/browse/HBASE-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631451#comment-13631451 ] Jieshan Bean commented on HBASE-8336: - [~ngrigor...@gmail.com] Nice find. We encountered the same problem before. +1 on the idea of adding a flag to represent its state. PooledHTable may be returned multiple times to the same pool Key: HBASE-8336 URL: https://issues.apache.org/jira/browse/HBASE-8336 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.95.0 Reporter: Nikolai Grigoriev Priority: Minor I have recently observed a very strange issue with an application using HBase and HTablePool. After an investigation I have found that the root cause was the piece of code that was calling close() twice on the same HTableInterface instance retrieved from HTablePool (created with default policy). A closer look at the code revealed that PooledHTable.close() calls returnTable(), which, in turn, places the table back into the QUEUE of the pooled tables. No checking of any kind is done so it is possible to call it multiple times and place multiple references to the same HTable into the same pool. This creates a number of negative effects: - pool grows on each close() call and eventually gets filled up with the references to the same HTable. From this moment the pool stops working as pool. - multiple callers will get the same instance of HTable while expecting to have unique instances - once the pool is full, next call to close() will result to the call to the real close() method of HTable. This will make HTable unusable as close() call may shutdown() the internal thread pool. From this moment other attempts to use this HTable will fail with RejectedExecutionException. And since the HTablePool will have additional references to that HTable, other users of the pool will just start failing on any call that leads to flushCommits() The problem was, obviously, triggered by bad code on our side. But I think the pool has to be protected. Probably the best way to fix it would be to implement a flag in PooledHTable that represent its state (leased/returned) and once close() is called, it would be returned. From this moment any operations on this PooledHTable would result in something like IllegalStateException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631453#comment-13631453 ] Jieshan Bean commented on HBASE-8251: - Yes, we can. enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch, HBASE-8251-94-v2.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8251: Resolution: Fixed Status: Resolved (was: Patch Available) enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch, HBASE-8251-94-v2.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13631455#comment-13631455 ] Jieshan Bean commented on HBASE-8251: - Sorry, it should be duplicated:(. I linked this issue to HBASE-7824. enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch, HBASE-8251-94-v2.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8325) ReplicationSource read a empty HLog throws EOFException
[ https://issues.apache.org/jira/browse/HBASE-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629652#comment-13629652 ] Jieshan Bean commented on HBASE-8325: - Aggree, the latest patch in HBASE-7122 covers this issue, I suggest to resolve this issue as duplicate. ReplicationSource read a empty HLog throws EOFException --- Key: HBASE-8325 URL: https://issues.apache.org/jira/browse/HBASE-8325 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.5 Environment: replication enabled Reporter: zavakid Priority: Critical I'm using the replication of Hbase in my test environment. When a replicationSource open a empty HLog, the EOFException throws. It is because the Reader can't read the SequenceFile's meta data, but there's no data at all, so it throws the EOFException. Should we detect the empty file and processed it, like we process the FileNotFoundException? here's the code: {code:java} /** * Open a reader on the current path * * @param sleepMultiplier by how many times the default sleeping time is augmented * @return true if we should continue with that file, false if we are over with it */ protected boolean openReader(int sleepMultiplier) { try { LOG.debug(Opening log for replication + this.currentPath.getName() + at + this.repLogReader.getPosition()); try { this.reader = repLogReader.openReader(this.currentPath); } catch (FileNotFoundException fnfe) { if (this.queueRecovered) { // We didn't find the log in the archive directory, look if it still // exists in the dead RS folder (there could be a chain of failures // to look at) LOG.info(NB dead servers : + deadRegionServers.length); for (int i = this.deadRegionServers.length - 1; i = 0; i--) { Path deadRsDirectory = new Path(manager.getLogDir().getParent(), this.deadRegionServers[i]); Path[] locs = new Path[] { new Path(deadRsDirectory, currentPath.getName()), new Path(deadRsDirectory.suffix(HLog.SPLITTING_EXT), currentPath.getName()), }; for (Path possibleLogLocation : locs) { LOG.info(Possible location + possibleLogLocation.toUri().toString()); if (this.manager.getFs().exists(possibleLogLocation)) { // We found the right new location LOG.info(Log + this.currentPath + still exists at + possibleLogLocation); // Breaking here will make us sleep since reader is null return true; } } } // TODO What happens if the log was missing from every single location? // Although we need to check a couple of times as the log could have // been moved by the master between the checks // It can also happen if a recovered queue wasn't properly cleaned, // such that the znode pointing to a log exists but the log was // deleted a long time ago. // For the moment, we'll throw the IO and processEndOfFile throw new IOException(File from recovered queue is + nowhere to be found, fnfe); } else { // If the log was archived, continue reading from there Path archivedLogLocation = new Path(manager.getOldLogDir(), currentPath.getName()); if (this.manager.getFs().exists(archivedLogLocation)) { currentPath = archivedLogLocation; LOG.info(Log + this.currentPath + was moved to + archivedLogLocation); // Open the log at the new location this.openReader(sleepMultiplier); } // TODO What happens the log is missing in both places? } } } catch (IOException ioe) { LOG.warn(peerClusterZnode + Got: , ioe); this.reader = null; // TODO Need a better way to determinate if a file is really gone but // TODO without scanning all logs dir if (sleepMultiplier == this.maxRetriesMultiplier) { LOG.warn(Waited too long for this file, considering dumping); return !processEndOfFile(); } } return true; } {code} there's a method called {code:java}processEndOfFile(){code} should we add this case in it? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7122) Proper warning message when opening a log file with no entries (idle cluster)
[ https://issues.apache.org/jira/browse/HBASE-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626323#comment-13626323 ] Jieshan Bean commented on HBASE-7122: - Yes, HLog is not empty if it closed succesfully. We may not get EOF. But if we get IOE during close, what will happen? Proper warning message when opening a log file with no entries (idle cluster) - Key: HBASE-7122 URL: https://issues.apache.org/jira/browse/HBASE-7122 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.2 Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Fix For: 0.95.1 Attachments: HBase-7122-94.patch, HBase-7122-95.patch, HBase-7122.patch, HBASE-7122.v2.patch In case the cluster is idle and the log has rolled (offset to 0), replicationSource tries to open the log and gets an EOF exception. This gets printed after every 10 sec until an entry is inserted in it. {code} 2012-11-07 15:47:40,924 DEBUG regionserver.ReplicationSource (ReplicationSource.java:openReader(487)) - Opening log for replication c0315.hal.cloudera.com%2C40020%2C1352324202860.1352327804874 at 0 2012-11-07 15:47:40,926 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(543)) - 1 Got: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:491) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:290) 2012-11-07 15:47:40,927 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(547)) - Waited too long for this file, considering dumping 2012-11-07 15:47:40,927 DEBUG regionserver.ReplicationSource (ReplicationSource.java:sleepForRetries(562)) - Unable to open a reader, sleeping 1000 times 10 {code} We should reduce the log spewing in this case (or some informative message, based on the offset). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7122) Proper warning message when opening a log file with no entries (idle cluster)
[ https://issues.apache.org/jira/browse/HBASE-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627326#comment-13627326 ] Jieshan Bean commented on HBASE-7122: - +1 on patch v2. Proper warning message when opening a log file with no entries (idle cluster) - Key: HBASE-7122 URL: https://issues.apache.org/jira/browse/HBASE-7122 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.2 Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Fix For: 0.95.1 Attachments: HBase-7122-94.patch, HBase-7122-95.patch, HBase-7122-95-v2.patch, HBase-7122.patch, HBASE-7122.v2.patch In case the cluster is idle and the log has rolled (offset to 0), replicationSource tries to open the log and gets an EOF exception. This gets printed after every 10 sec until an entry is inserted in it. {code} 2012-11-07 15:47:40,924 DEBUG regionserver.ReplicationSource (ReplicationSource.java:openReader(487)) - Opening log for replication c0315.hal.cloudera.com%2C40020%2C1352324202860.1352327804874 at 0 2012-11-07 15:47:40,926 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(543)) - 1 Got: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:491) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:290) 2012-11-07 15:47:40,927 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(547)) - Waited too long for this file, considering dumping 2012-11-07 15:47:40,927 DEBUG regionserver.ReplicationSource (ReplicationSource.java:sleepForRetries(562)) - Unable to open a reader, sleeping 1000 times 10 {code} We should reduce the log spewing in this case (or some informative message, based on the offset). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7122) Proper warning message when opening a log file with no entries (idle cluster)
[ https://issues.apache.org/jira/browse/HBASE-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627359#comment-13627359 ] Jieshan Bean commented on HBASE-7122: - BTW, one minor comment:). Please add curly brackets to the below code: {code} +if (this.repLogReader.getPosition() == 0 !queueRecovered queue.size() == 0) + return true; {code} Proper warning message when opening a log file with no entries (idle cluster) - Key: HBASE-7122 URL: https://issues.apache.org/jira/browse/HBASE-7122 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.2 Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Fix For: 0.95.1 Attachments: HBase-7122-94.patch, HBase-7122-95.patch, HBase-7122-95-v2.patch, HBase-7122.patch, HBASE-7122.v2.patch In case the cluster is idle and the log has rolled (offset to 0), replicationSource tries to open the log and gets an EOF exception. This gets printed after every 10 sec until an entry is inserted in it. {code} 2012-11-07 15:47:40,924 DEBUG regionserver.ReplicationSource (ReplicationSource.java:openReader(487)) - Opening log for replication c0315.hal.cloudera.com%2C40020%2C1352324202860.1352327804874 at 0 2012-11-07 15:47:40,926 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(543)) - 1 Got: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:491) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:290) 2012-11-07 15:47:40,927 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(547)) - Waited too long for this file, considering dumping 2012-11-07 15:47:40,927 DEBUG regionserver.ReplicationSource (ReplicationSource.java:sleepForRetries(562)) - Unable to open a reader, sleeping 1000 times 10 {code} We should reduce the log spewing in this case (or some informative message, based on the offset). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627367#comment-13627367 ] Jieshan Bean commented on HBASE-8251: - [~jeffreyz] [~zjushch] Do you have further comments? Thank you. enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch, HBASE-8251-94-v2.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8251: Status: Patch Available (was: Open) enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch, HBASE-8251-94-v2.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627380#comment-13627380 ] Jieshan Bean commented on HBASE-8251: - It seems this approach conflicts with the patch in HBASE-7824. enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch, HBASE-8251-94-v2.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627448#comment-13627448 ] Jieshan Bean commented on HBASE-8251: - Yes. I'm reviewing that patch. Seems I missed a wonderful discussion:) enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch, HBASE-8251-94-v2.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7750) We should throw IOE when calling HRegionServer#replicateLogEntries if ReplicationSink is null
[ https://issues.apache.org/jira/browse/HBASE-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-7750: Attachment: HBASE-7750-trunk.patch HBASE-7750-94.patch We should throw IOE when calling HRegionServer#replicateLogEntries if ReplicationSink is null - Key: HBASE-7750 URL: https://issues.apache.org/jira/browse/HBASE-7750 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4, 0.95.2 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-7750-94.patch, HBASE-7750-trunk.patch It may be an expected behavior, but I think it's better to do something. We configured hbase.replication as true in master cluster, and added peer. But forgot to configure hbase.replication on slave cluster side. ReplicationSource read HLog, shipped log edits, and logged position. Everything seemed alright. But data was not present in slave cluster. So I think, slave cluster should throw exception to master cluster instead of return directly: {code} public void replicateLogEntries(final HLog.Entry[] entries) throws IOException { checkOpen(); if (this.replicationSinkHandler == null) return; this.replicationSinkHandler.replicateLogEntries(entries); } {code} I would like to hear your comments on this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7750) We should throw IOE when calling HRegionServer#replicateLogEntries if ReplicationSink is null
[ https://issues.apache.org/jira/browse/HBASE-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-7750: Status: Patch Available (was: Open) We should throw IOE when calling HRegionServer#replicateLogEntries if ReplicationSink is null - Key: HBASE-7750 URL: https://issues.apache.org/jira/browse/HBASE-7750 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4, 0.95.2 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-7750-94.patch, HBASE-7750-trunk.patch It may be an expected behavior, but I think it's better to do something. We configured hbase.replication as true in master cluster, and added peer. But forgot to configure hbase.replication on slave cluster side. ReplicationSource read HLog, shipped log edits, and logged position. Everything seemed alright. But data was not present in slave cluster. So I think, slave cluster should throw exception to master cluster instead of return directly: {code} public void replicateLogEntries(final HLog.Entry[] entries) throws IOException { checkOpen(); if (this.replicationSinkHandler == null) return; this.replicationSinkHandler.replicateLogEntries(entries); } {code} I would like to hear your comments on this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625234#comment-13625234 ] Jieshan Bean commented on HBASE-8251: - bq.I'm not sure if it's a good modification for scenario that a Meta location is pointing to an offline server or a wrong server. The key point is whether this offline server or wrong server is a registered server on this master. If already rigistered, SSH will be triggered. This patch can avoid race. Otherwise, SSH would not happen. enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch, HBASE-8251-94-v2.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7122) Proper warning message when opening a log file with no entries (idle cluster)
[ https://issues.apache.org/jira/browse/HBASE-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626043#comment-13626043 ] Jieshan Bean commented on HBASE-7122: - I think add this check is not enough. There still has the chance of empty log(Not the writing one) in normal log list. So we need to check whether this log is the one in use. Proper warning message when opening a log file with no entries (idle cluster) - Key: HBASE-7122 URL: https://issues.apache.org/jira/browse/HBASE-7122 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.2 Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Fix For: 0.95.1 Attachments: HBase-7122-94.patch, HBase-7122-95.patch, HBase-7122.patch, HBASE-7122.v2.patch In case the cluster is idle and the log has rolled (offset to 0), replicationSource tries to open the log and gets an EOF exception. This gets printed after every 10 sec until an entry is inserted in it. {code} 2012-11-07 15:47:40,924 DEBUG regionserver.ReplicationSource (ReplicationSource.java:openReader(487)) - Opening log for replication c0315.hal.cloudera.com%2C40020%2C1352324202860.1352327804874 at 0 2012-11-07 15:47:40,926 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(543)) - 1 Got: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:491) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:290) 2012-11-07 15:47:40,927 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(547)) - Waited too long for this file, considering dumping 2012-11-07 15:47:40,927 DEBUG regionserver.ReplicationSource (ReplicationSource.java:sleepForRetries(562)) - Unable to open a reader, sleeping 1000 times 10 {code} We should reduce the log spewing in this case (or some informative message, based on the offset). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8266) Master cannot start if TableNotFoundException is thrown while partial table recovery
[ https://issues.apache.org/jira/browse/HBASE-8266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626149#comment-13626149 ] Jieshan Bean commented on HBASE-8266: - [~ram_krish] can we just skip calling handleEnableTable under this scenario? Patch looks good otherwise:) Master cannot start if TableNotFoundException is thrown while partial table recovery Key: HBASE-8266 URL: https://issues.apache.org/jira/browse/HBASE-8266 Project: HBase Issue Type: Bug Affects Versions: 0.94.6, 0.95.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.98.0, 0.94.7, 0.95.1 Attachments: HBASE-8266_0.94.patch, HBASE-8266_1.patch, HBASE-8266.patch I was trying to create a table. The table creation failed {code} java.io.IOException: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Could not instantiate a region instance. at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:133) at org.apache.hadoop.hbase.master.handler.CreateTableHandler.handleCreateHdfsRegions(CreateTableHandler.java:256) at org.apache.hadoop.hbase.master.handler.CreateTableHandler.handleCreateTable(CreateTableHandler.java:204) at org.apache.hadoop.hbase.master.handler.CreateTableHandler.process(CreateTableHandler.java:153) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:130) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Could not instantiate a region instance. at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:126) ... 7 more Caused by: java.lang.IllegalStateException: Could not instantiate a region instance. at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:3765) at org.apache.hadoop.hbase.regionserver.HRegion.createHRegion(HRegion.java:3870) at org.apache.hadoop.hbase.util.ModifyRegionUtils$1.call(ModifyRegionUtils.java:106) at org.apache.hadoop.hbase.util.ModifyRegionUtils$1.call(ModifyRegionUtils.java:103) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) ... 3 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:3762) ... 11 more Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/CompoundConfiguration$1 at org.apache.hadoop.hbase.CompoundConfiguration.add(CompoundConfiguration.java:82) at org.apache.hadoop.hbase.regionserver.HRegion.init(HRegion.java:438) at org.apache.hadoop.hbase.regionserver.HRegion.init(HRegion.java:401) ... 16 more {code} Am not sure of the above failure. The same setup is able to create new tables. Now the table is already in ENABLING state. The master was restarted. Now as the table was found in ENABLING state but not added to META the EnableTableHandler {code} 2013-04-03 18:33:03,850 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. org.apache.hadoop.hbase.exceptions.TableNotFoundException: TestTable at org.apache.hadoop.hbase.master.handler.EnableTableHandler.prepare(EnableTableHandler.java:89) at org.apache.hadoop.hbase.master.AssignmentManager.recoverTableInEnablingState(AssignmentManager.java:2586) at org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:390) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:777) at
[jira] [Commented] (HBASE-8302) HBase dynamic configuration
[ https://issues.apache.org/jira/browse/HBASE-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626157#comment-13626157 ] Jieshan Bean commented on HBASE-8302: - [~brianhbase] is this same as HBASE-3909 and HBASE-8292? HBase dynamic configuration --- Key: HBASE-8302 URL: https://issues.apache.org/jira/browse/HBASE-8302 Project: HBase Issue Type: Improvement Affects Versions: 0.94.3 Reporter: Brian Fu if change the HBase configuration, we need to restart the cluster to take effect, we want to dynamically configure such parameters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624848#comment-13624848 ] Jieshan Bean commented on HBASE-8251: - Yes, one corner case is META RS is killed just before new assignment happening(Killing happens either during calling processRegionInTransitionAndBlockUntilAssigned or verifyMetaRegionLocation), so rit is false and metaRegionLocation is false(Only under this scenario may trigger a new assignment). It may cause data-loss and double assignment. Thanks, Chunhui, Rama and Jeffrey. I'm thinking about adding a check(check for whether currentMetaServer is a online server or a processing dead server, need to change something in ServerManager) before calling assignMeta. enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8251: Attachment: HBASE-8251-94-v2.patch New version of patch for 94, addressed all the comments. enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch, HBASE-8251-94-v2.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625064#comment-13625064 ] Jieshan Bean commented on HBASE-8251: - bq.*, currentMetaServer is marked dead by master. Then we still have two possible places to assign meta. This RS should belong to the set of ServerManager#onlineServers just before marking it as dead(See the related code of ServerManager#onlineServers and DeadServer#processingDeadServers. ). So the below check returns true: {code} this.serverManager .isOnlineOrProcessingDeadServer(currentMetaServer). {code} So needToAssign is false. No new assign would happen under this scenario. Correct me if I misunderstood you:). Thank you. enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch, HBASE-8251-94-v2.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean reassigned HBASE-8251: --- Assignee: Jieshan Bean enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8251) enable SSH before assign META on Master startup
Jieshan Bean created HBASE-8251: --- Summary: enable SSH before assign META on Master startup Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8229) Replication code logs like crazy if a target table cannot be found.
[ https://issues.apache.org/jira/browse/HBASE-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13620667#comment-13620667 ] Jieshan Bean commented on HBASE-8229: - +1 on first version of patch. Replication code logs like crazy if a target table cannot be found. --- Key: HBASE-8229 URL: https://issues.apache.org/jira/browse/HBASE-8229 Project: HBase Issue Type: Bug Components: Replication Reporter: Lars Hofhansl Fix For: 0.95.1, 0.98.0, 0.94.7 Attachments: 8229-0.94.txt, 8229-0.94-V2.txt One of our RS/DN machines ran out of diskspace on the partition to which we write the log files. It turns out we still had a table in our source cluster with REPLICATION_SCOPE=1 that did not have a matching table in the remote cluster. In then logged a long stack trace every 50ms or so, over a few days that filled up our log partition. Since ReplicationSource cannot make any progress in this case anyway, it should probably sleep a bit before retrying (or at least limit the rate at which it spews out these exceptions to the log). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8253) A corrupted log blocked ReplicationSource
Jieshan Bean created HBASE-8253: --- Summary: A corrupted log blocked ReplicationSource Key: HBASE-8253 URL: https://issues.apache.org/jira/browse/HBASE-8253 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean A writting log got corrupted when we forcely power down one node. Only partial of last WALEdit was written into that log. And that log was not the last one in replication queue. ReplicationSource was blocked under this scenario. A lot of logs like below were printed: {noformat} 2013-03-30 06:53:48,628 WARN [regionserver26003-EventThread.replicationSource,1] 1 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:334) java.io.EOFException: hdfs://hacluster/hbase/.logs/master11,26003,1364530862620/master11%2C26003%2C1364530862620.1364553936510, entryStart=40434738, pos=40450048, end=40450048, edit=0 at sun.reflect.GeneratedConstructorAccessor42.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:295) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:240) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:84) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:412) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:330) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2282) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2181) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2227) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:238) ... 3 more .. 2013-03-30 06:54:38,899 WARN [regionserver26003-EventThread.replicationSource,1] 1 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:334) java.io.EOFException: hdfs://hacluster/hbase/.logs/master11,26003,1364530862620/master11%2C26003%2C1364530862620.1364553936510, entryStart=40434738, pos=40450048, end=40450048, edit=0 at sun.reflect.GeneratedConstructorAccessor42.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:295) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:240) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:84) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:412) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:330) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2282) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2181) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2227) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:238) ... 3 more ... {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8253) A corrupted log blocked ReplicationSource
[ https://issues.apache.org/jira/browse/HBASE-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8253: Attachment: HBASE-8253-94.patch Patch for discussion. In ReplicationSource#readAllEntriesToReplicateOrNextFile, only read for the first edit may throw EOF. So when we get EOF, currentNbEntries should be 0. No other case. Please correct me if I am wrong. A corrupted log blocked ReplicationSource - Key: HBASE-8253 URL: https://issues.apache.org/jira/browse/HBASE-8253 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8253-94.patch A writting log got corrupted when we forcely power down one node. Only partial of last WALEdit was written into that log. And that log was not the last one in replication queue. ReplicationSource was blocked under this scenario. A lot of logs like below were printed: {noformat} 2013-03-30 06:53:48,628 WARN [regionserver26003-EventThread.replicationSource,1] 1 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:334) java.io.EOFException: hdfs://hacluster/hbase/.logs/master11,26003,1364530862620/master11%2C26003%2C1364530862620.1364553936510, entryStart=40434738, pos=40450048, end=40450048, edit=0 at sun.reflect.GeneratedConstructorAccessor42.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:295) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:240) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:84) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:412) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:330) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2282) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2181) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2227) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:238) ... 3 more .. 2013-03-30 06:54:38,899 WARN [regionserver26003-EventThread.replicationSource,1] 1 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:334) java.io.EOFException: hdfs://hacluster/hbase/.logs/master11,26003,1364530862620/master11%2C26003%2C1364530862620.1364553936510, entryStart=40434738, pos=40450048, end=40450048, edit=0 at sun.reflect.GeneratedConstructorAccessor42.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:295) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:240) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:84) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:412) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:330) Caused by: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2282) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2181) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2227) at
[jira] [Commented] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13620703#comment-13620703 ] Jieshan Bean commented on HBASE-8251: - [~rajesh23]I don't think so, Master was blocked at AM#processRegionInTransitionAndBlockUntilAssigned: {code} // Work on meta region status.setStatus(Assigning META region); rit = this.assignmentManager. processRegionInTransitionAndBlockUntilAssigned(HRegionInfo.FIRST_META_REGIONINFO); boolean metaRegionLocation = this.catalogTracker.verifyMetaRegionLocation(timeout); {code} enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8251: Attachment: HBASE-8251-94.patch Patch for 94. enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8251) enable SSH before assign META on Master startup
[ https://issues.apache.org/jira/browse/HBASE-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621535#comment-13621535 ] Jieshan Bean commented on HBASE-8251: - [~ted_yu] Sorry, we will be on vacation before April 7th. So I can only submit the new patch on 7th:). enable SSH before assign META on Master startup --- Key: HBASE-8251 URL: https://issues.apache.org/jira/browse/HBASE-8251 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8251-94.patch I think HBASE-5918 could not fix this issue. In HMaster#assignRootAndMeta: 1. Assign ROOT. 2. Block until ROOT be opened. 3. Assign META. 4. Block until META be opened. SSH is enabled after step 4. So if the RS who host ROOT dies before step 4, master will be blocked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8230) Possible NPE on regionserver abort if replication service has not been started
[ https://issues.apache.org/jira/browse/HBASE-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13621541#comment-13621541 ] Jieshan Bean commented on HBASE-8230: - [~ted_yu] Do you have any other comments on this issue? Thank you. Possible NPE on regionserver abort if replication service has not been started -- Key: HBASE-8230 URL: https://issues.apache.org/jira/browse/HBASE-8230 Project: HBase Issue Type: Bug Components: regionserver, Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8230-94.patch, HBASE-8230-trunk.patch RegionServer got Exception on calling setupWALAndReplication, so entered abort flow. Since replicationSink had not been inialized yet, we got below exception: {noformat} Exception in thread regionserver26003 java.lang.NullPointerException at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:120) at org.apache.hadoop.hbase.regionserver.HRegionServer.join(HRegionServer.java:1803) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:834) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8229) Replication code logs like crazy if a target table cannot be found.
[ https://issues.apache.org/jira/browse/HBASE-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619722#comment-13619722 ] Jieshan Bean commented on HBASE-8229: - bq.The idea is to return back into the run() loop of ReplicationSource, so that the edits are rechecked (and not shipped to the peer if the local table's status has changed). I didn't see anywhere do this re-check, hope I misread the code:). Even if local tabls' replication status has been changed, ReplicationSource still has the responsibility to replicate all the edits before the time of table got changed, right? So I prefer to not return back directly. Just let it retry and sleep until that table be created. Replication code logs like crazy if a target table cannot be found. --- Key: HBASE-8229 URL: https://issues.apache.org/jira/browse/HBASE-8229 Project: HBase Issue Type: Bug Components: Replication Reporter: Lars Hofhansl Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8229-0.94.txt, 8229-0.94-V2.txt One of our RS/DN machines ran out of diskspace on the partition to which we write the log files. It turns out we still had a table in our source cluster with REPLICATION_SCOPE=1 that did not have a matching table in the remote cluster. In then logged a long stack trace every 50ms or so, over a few days that filled up our log partition. Since ReplicationSource cannot make any progress in this case anyway, it should probably sleep a bit before retrying (or at least limit the rate at which it spews out these exceptions to the log). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8229) Replication code logs like crazy if a target table cannot be found.
[ https://issues.apache.org/jira/browse/HBASE-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13620499#comment-13620499 ] Jieshan Bean commented on HBASE-8229: - bq.Wouldn't it recreate the set of edits to ship in readAllEntriesToReplicateOrNextFile(...) called from run(). Yes, it will read and recreate the set again. But it's the same set as the previous one. The current logic in removeNonReplicableEdits only check the scope property which owned by the edit itself, not the table scope. Replication code logs like crazy if a target table cannot be found. --- Key: HBASE-8229 URL: https://issues.apache.org/jira/browse/HBASE-8229 Project: HBase Issue Type: Bug Components: Replication Reporter: Lars Hofhansl Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: 8229-0.94.txt, 8229-0.94-V2.txt One of our RS/DN machines ran out of diskspace on the partition to which we write the log files. It turns out we still had a table in our source cluster with REPLICATION_SCOPE=1 that did not have a matching table in the remote cluster. In then logged a long stack trace every 50ms or so, over a few days that filled up our log partition. Since ReplicationSource cannot make any progress in this case anyway, it should probably sleep a bit before retrying (or at least limit the rate at which it spews out these exceptions to the log). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8213) global authorization may lose efficacy
[ https://issues.apache.org/jira/browse/HBASE-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618693#comment-13618693 ] Jieshan Bean commented on HBASE-8213: - [~apurtell]Thank you for the trunk patch. I planned to do that after review:) global authorization may lose efficacy --- Key: HBASE-8213 URL: https://issues.apache.org/jira/browse/HBASE-8213 Project: HBase Issue Type: Bug Components: security Affects Versions: 0.95.0, 0.96.0, 0.94.7 Reporter: Jieshan Bean Assignee: Jieshan Bean Priority: Critical Attachments: HBASE-8213-94.patch, HBASE-8213-trunk.patch It depends on the order of which region be opened first. Suppose we have one 1 regionserver and only 1 user region REGION-A on this server, _acl_ region was on another regionserver. _acl_ was opened a few seconds before REGION-A. The global authorization data read from Zookeeper was overwritten by the data read from configuration. {code} private TableAuthManager(ZooKeeperWatcher watcher, Configuration conf) throws IOException { this.conf = conf; this.zkperms = new ZKPermissionWatcher(watcher, this, conf); try { // Read global authorization data from zookeeper. this.zkperms.start(); } catch (KeeperException ke) { LOG.error(ZooKeeper initialization failed, ke); } // It will overwrite globalCache. // initialize global permissions based on configuration globalCache = initGlobal(conf); } {code} This issue can be easily reproduced by below steps: 1. Start a cluster with 3 regionservers. 2. Create a new table T1. 3. grant a new user USER-A with global authorization. 4. Kill 1 regionserver RS3 and switch balance off. 5. Start regionserver RS3. 6. Assign region T1 to RS3. 7. Put data with user USER-A. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8213) global authorization may lose efficacy
[ https://issues.apache.org/jira/browse/HBASE-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619386#comment-13619386 ] Jieshan Bean commented on HBASE-8213: - Thank you for the review, [~apurtell][~yuzhih...@gmail.com]:) global authorization may lose efficacy --- Key: HBASE-8213 URL: https://issues.apache.org/jira/browse/HBASE-8213 Project: HBase Issue Type: Bug Components: security Affects Versions: 0.95.0, 0.96.0, 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: HBASE-8213-94.patch, HBASE-8213-trunk.patch It depends on the order of which region be opened first. Suppose we have one 1 regionserver and only 1 user region REGION-A on this server, _acl_ region was on another regionserver. _acl_ was opened a few seconds before REGION-A. The global authorization data read from Zookeeper was overwritten by the data read from configuration. {code} private TableAuthManager(ZooKeeperWatcher watcher, Configuration conf) throws IOException { this.conf = conf; this.zkperms = new ZKPermissionWatcher(watcher, this, conf); try { // Read global authorization data from zookeeper. this.zkperms.start(); } catch (KeeperException ke) { LOG.error(ZooKeeper initialization failed, ke); } // It will overwrite globalCache. // initialize global permissions based on configuration globalCache = initGlobal(conf); } {code} This issue can be easily reproduced by below steps: 1. Start a cluster with 3 regionservers. 2. Create a new table T1. 3. grant a new user USER-A with global authorization. 4. Kill 1 regionserver RS3 and switch balance off. 5. Start regionserver RS3. 6. Assign region T1 to RS3. 7. Put data with user USER-A. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8230) Possible NPE on regionserver abort if replication service has not been started
[ https://issues.apache.org/jira/browse/HBASE-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8230: Attachment: HBASE-8230-trunk.patch Patch for trunk. Possible NPE on regionserver abort if replication service has not been started -- Key: HBASE-8230 URL: https://issues.apache.org/jira/browse/HBASE-8230 Project: HBase Issue Type: Bug Components: regionserver, Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8230-94.patch, HBASE-8230-trunk.patch RegionServer got Exception on calling setupWALAndReplication, so entered abort flow. Since replicationSink had not been inialized yet, we got below exception: {noformat} Exception in thread regionserver26003 java.lang.NullPointerException at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:120) at org.apache.hadoop.hbase.regionserver.HRegionServer.join(HRegionServer.java:1803) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:834) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8230) Possible NPE on regionserver abort if replication service has not been started
[ https://issues.apache.org/jira/browse/HBASE-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8230: Status: Patch Available (was: Open) Possible NPE on regionserver abort if replication service has not been started -- Key: HBASE-8230 URL: https://issues.apache.org/jira/browse/HBASE-8230 Project: HBase Issue Type: Bug Components: regionserver, Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8230-94.patch, HBASE-8230-trunk.patch RegionServer got Exception on calling setupWALAndReplication, so entered abort flow. Since replicationSink had not been inialized yet, we got below exception: {noformat} Exception in thread regionserver26003 java.lang.NullPointerException at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:120) at org.apache.hadoop.hbase.regionserver.HRegionServer.join(HRegionServer.java:1803) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:834) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8229) Replication code logs like crazy if a target table cannot be found.
[ https://issues.apache.org/jira/browse/HBASE-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13619487#comment-13619487 ] Jieshan Bean commented on HBASE-8229: - Yes, it's really a good idea. Replication code logs like crazy if a target table cannot be found. --- Key: HBASE-8229 URL: https://issues.apache.org/jira/browse/HBASE-8229 Project: HBase Issue Type: Bug Components: Replication Reporter: Lars Hofhansl Fix For: 0.95.0, 0.98.0, 0.94.7 One of our RS/DN machines ran out of diskspace on the partition to which we write the log files. It turns out we still had a table in our source cluster with REPLICATION_SCOPE=1 that did not have a matching table in the remote cluster. In then logged a long stack trace every 50ms or so, over a few days that filled up our log partition. Since ReplicationSource cannot make any progress in this case anyway, it should probably sleep a bit before retrying (or at least limit the rate at which it spews out these exceptions to the log). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8230) Possible NPE on regionserver abort if replication service has not been started
[ https://issues.apache.org/jira/browse/HBASE-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618549#comment-13618549 ] Jieshan Bean commented on HBASE-8230: - bq.Did the failure happen when region server restarted ? Yes. bq.If this was repeatable, I would suggest finding the root cause. The root cause in our env was NameNode was in safemode: {noformat} 2013-03-29 10:32:42,260 FATAL [regionserver26003] ABORTING region server om-host2,26003,1364524173470: Unhandled exception: cannot get log writer org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:1737) java.io.IOException: cannot get log writer at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:757) at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:701) at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:637) at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:582) at org.apache.hadoop.hbase.regionserver.wal.HLog.init(HLog.java:436) at org.apache.hadoop.hbase.regionserver.wal.HLog.init(HLog.java:362) at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateHLog(HRegionServer.java:1327) at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1316) at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1030) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:706) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create file/hbase/.logs/om-host2,26003,1364524173470/om-host2%2C26003%2C1364524173470.1364524361366. Name node is in safe mode. The reported blocks 14 has reached the threshold 0.9990 of total blocks 14. Safe mode will be turned off automatically in 21 seconds. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1601) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1547) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:412) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:204) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:43664) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:924) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1710) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1706) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1704) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:209) at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:754) ... 10 more {noformat} Possible NPE on regionserver abort if replication service has not been started -- Key: HBASE-8230 URL: https://issues.apache.org/jira/browse/HBASE-8230 Project: HBase Issue Type: Bug Components: regionserver, Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8230-94.patch RegionServer got Exception on calling setupWALAndReplication, so entered abort flow. Since replicationSink had not been inialized yet, we got below exception: {noformat} Exception in thread regionserver26003 java.lang.NullPointerException at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:120) at org.apache.hadoop.hbase.regionserver.HRegionServer.join(HRegionServer.java:1803) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:834) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8229) Replication code logs like crazy if a target table cannot be found.
[ https://issues.apache.org/jira/browse/HBASE-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8229: Component/s: Replication Replication code logs like crazy if a target table cannot be found. --- Key: HBASE-8229 URL: https://issues.apache.org/jira/browse/HBASE-8229 Project: HBase Issue Type: Bug Components: Replication Reporter: Lars Hofhansl Fix For: 0.95.0, 0.98.0, 0.94.7 One of our RS/DN machines ran out of diskspace on the partition to which we write the log files. It turns out we still had a table in our source cluster with REPLICATION_SCOPE=1 that did not have a matching table in the remote cluster. In then logged a long stack trace every 50ms or so, over a few days that filled up our log partition. Since ReplicationSource cannot make any progress in this case anyway, it should probably sleep a bit before retrying (or at least limit the rate at which it spews out these exceptions to the log). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8229) Replication code logs like crazy if a target table cannot be found.
[ https://issues.apache.org/jira/browse/HBASE-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618559#comment-13618559 ] Jieshan Bean commented on HBASE-8229: - bq.For this issue, I'll just add the same waiting we do when the peer is down (which is the same logical behavior we currently have, but without the insane busy retrying). +1 Replication code logs like crazy if a target table cannot be found. --- Key: HBASE-8229 URL: https://issues.apache.org/jira/browse/HBASE-8229 Project: HBase Issue Type: Bug Components: Replication Reporter: Lars Hofhansl Fix For: 0.95.0, 0.98.0, 0.94.7 One of our RS/DN machines ran out of diskspace on the partition to which we write the log files. It turns out we still had a table in our source cluster with REPLICATION_SCOPE=1 that did not have a matching table in the remote cluster. In then logged a long stack trace every 50ms or so, over a few days that filled up our log partition. Since ReplicationSource cannot make any progress in this case anyway, it should probably sleep a bit before retrying (or at least limit the rate at which it spews out these exceptions to the log). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8229) Replication code logs like crazy if a target table cannot be found.
[ https://issues.apache.org/jira/browse/HBASE-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618003#comment-13618003 ] Jieshan Bean commented on HBASE-8229: - I suggest to let ReplicationSource wait if one replicating table is not present, likes the scenario of peer cluster is unavailable. Replication code logs like crazy if a target table cannot be found. --- Key: HBASE-8229 URL: https://issues.apache.org/jira/browse/HBASE-8229 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.95.0, 0.98.0, 0.94.7 One of our RS/DN machines ran out of diskspace on the partition to which we write the log files. It turns out we still had a table in our source cluster with REPLICATION_SCOPE=1 that did not have a matching table in the remote cluster. In then logged a long stack trace every 50ms or so, over a few days that filled up our log partition. Since ReplicationSource cannot make any progress in this case anyway, it should probably sleep a bit before retrying (or at least limit the rate at which it spews out these exceptions to the log). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8213) global authorization may lose efficacy
[ https://issues.apache.org/jira/browse/HBASE-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8213: Attachment: HBASE-8213-94.patch Patch for 94. global authorization may lose efficacy --- Key: HBASE-8213 URL: https://issues.apache.org/jira/browse/HBASE-8213 Project: HBase Issue Type: Bug Components: security Affects Versions: 0.95.0, 0.96.0, 0.94.7 Reporter: Jieshan Bean Assignee: Jieshan Bean Priority: Critical Attachments: HBASE-8213-94.patch It depends on the order of which region be opened first. Suppose we have one 1 regionserver and only 1 user region REGION-A on this server, _acl_ region was on another regionserver. _acl_ was opened a few seconds before REGION-A. The global authorization data read from Zookeeper was overwritten by the data read from configuration. {code} private TableAuthManager(ZooKeeperWatcher watcher, Configuration conf) throws IOException { this.conf = conf; this.zkperms = new ZKPermissionWatcher(watcher, this, conf); try { // Read global authorization data from zookeeper. this.zkperms.start(); } catch (KeeperException ke) { LOG.error(ZooKeeper initialization failed, ke); } // It will overwrite globalCache. // initialize global permissions based on configuration globalCache = initGlobal(conf); } {code} This issue can be easily reproduced by below steps: 1. Start a cluster with 3 regionservers. 2. Create a new table T1. 3. grant a new user USER-A with global authorization. 4. Kill 1 regionserver RS3 and switch balance off. 5. Start regionserver RS3. 6. Assign region T1 to RS3. 7. Put data with user USER-A. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8230) Possible NPE on regionserver abort if replication service has not been started
Jieshan Bean created HBASE-8230: --- Summary: Possible NPE on regionserver abort if replication service has not been started Key: HBASE-8230 URL: https://issues.apache.org/jira/browse/HBASE-8230 Project: HBase Issue Type: Bug Components: regionserver, Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean RegionServer got Exception on calling setupWALAndReplication, so entered abort flow. Since replicationSink had not been inialized yet, we got below exception: {noformat} Exception in thread regionserver26003 java.lang.NullPointerException at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:120) at org.apache.hadoop.hbase.regionserver.HRegionServer.join(HRegionServer.java:1803) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:834) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8230) Possible NPE on regionserver abort if replication service has not been started
[ https://issues.apache.org/jira/browse/HBASE-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8230: Attachment: HBASE-8230-94.patch patch for 94. Possible NPE on regionserver abort if replication service has not been started -- Key: HBASE-8230 URL: https://issues.apache.org/jira/browse/HBASE-8230 Project: HBase Issue Type: Bug Components: regionserver, Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Attachments: HBASE-8230-94.patch RegionServer got Exception on calling setupWALAndReplication, so entered abort flow. Since replicationSink had not been inialized yet, we got below exception: {noformat} Exception in thread regionserver26003 java.lang.NullPointerException at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:120) at org.apache.hadoop.hbase.regionserver.HRegionServer.join(HRegionServer.java:1803) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:834) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-8230) Possible NPE on regionserver abort if replication service has not been started
[ https://issues.apache.org/jira/browse/HBASE-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean reassigned HBASE-8230: --- Assignee: Jieshan Bean Possible NPE on regionserver abort if replication service has not been started -- Key: HBASE-8230 URL: https://issues.apache.org/jira/browse/HBASE-8230 Project: HBase Issue Type: Bug Components: regionserver, Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8230-94.patch RegionServer got Exception on calling setupWALAndReplication, so entered abort flow. Since replicationSink had not been inialized yet, we got below exception: {noformat} Exception in thread regionserver26003 java.lang.NullPointerException at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:120) at org.apache.hadoop.hbase.regionserver.HRegionServer.join(HRegionServer.java:1803) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:834) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-1936) ClassLoader that loads from hdfs; useful adding filters to classpath without having to restart services
[ https://issues.apache.org/jira/browse/HBASE-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618019#comment-13618019 ] Jieshan Bean commented on HBASE-1936: - Sorry, I don’t have enough time to finish it currently. Please feel free to take it over if you interest on it:) ClassLoader that loads from hdfs; useful adding filters to classpath without having to restart services --- Key: HBASE-1936 URL: https://issues.apache.org/jira/browse/HBASE-1936 Project: HBase Issue Type: New Feature Reporter: stack Assignee: Jieshan Bean Labels: noob Attachments: cp_from_hdfs.patch, HBASE-1936-trunk(forReview).patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-1936) ClassLoader that loads from hdfs; useful adding filters to classpath without having to restart services
[ https://issues.apache.org/jira/browse/HBASE-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean reassigned HBASE-1936: --- Assignee: (was: Jieshan Bean) ClassLoader that loads from hdfs; useful adding filters to classpath without having to restart services --- Key: HBASE-1936 URL: https://issues.apache.org/jira/browse/HBASE-1936 Project: HBase Issue Type: New Feature Reporter: stack Labels: noob Attachments: cp_from_hdfs.patch, HBASE-1936-trunk(forReview).patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8230) Possible NPE on regionserver abort if replication service has not been started
[ https://issues.apache.org/jira/browse/HBASE-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618054#comment-13618054 ] Jieshan Bean commented on HBASE-8230: - Here's the exception: {noformat} 2013-03-29 10:32:42,251 INFO [regionserver26003] STOPPED: Failed initialization org.apache.hadoop.hbase.regionserver.HRegionServer.stop(HRegionServer.java:1665) 2013-03-29 10:32:42,253 ERROR [regionserver26003] Failed init org.apache.hadoop.hbase.regionserver.HRegionServer.cleanup(HRegionServer.java:1161) java.io.IOException: cannot get log writer at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:757) at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:701) at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:637) at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:582) at org.apache.hadoop.hbase.regionserver.wal.HLog.init(HLog.java:436) at org.apache.hadoop.hbase.regionserver.wal.HLog.init(HLog.java:362) at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateHLog(HRegionServer.java:1327) at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1316) at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1030) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:706) at java.lang.Thread.run(Thread.java:662) {noformat} Possible NPE on regionserver abort if replication service has not been started -- Key: HBASE-8230 URL: https://issues.apache.org/jira/browse/HBASE-8230 Project: HBase Issue Type: Bug Components: regionserver, Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8230-94.patch RegionServer got Exception on calling setupWALAndReplication, so entered abort flow. Since replicationSink had not been inialized yet, we got below exception: {noformat} Exception in thread regionserver26003 java.lang.NullPointerException at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:120) at org.apache.hadoop.hbase.regionserver.HRegionServer.join(HRegionServer.java:1803) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:834) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8230) Possible NPE on regionserver abort if replication service has not been started
[ https://issues.apache.org/jira/browse/HBASE-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618204#comment-13618204 ] Jieshan Bean commented on HBASE-8230: - Any exception occures before startServiceThreads may cause this NPE, right? so what caused the log writer creation failure is not the key point, I think. Possible NPE on regionserver abort if replication service has not been started -- Key: HBASE-8230 URL: https://issues.apache.org/jira/browse/HBASE-8230 Project: HBase Issue Type: Bug Components: regionserver, Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Attachments: HBASE-8230-94.patch RegionServer got Exception on calling setupWALAndReplication, so entered abort flow. Since replicationSink had not been inialized yet, we got below exception: {noformat} Exception in thread regionserver26003 java.lang.NullPointerException at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:129) at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:120) at org.apache.hadoop.hbase.regionserver.HRegionServer.join(HRegionServer.java:1803) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:834) at java.lang.Thread.run(Thread.java:662) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8229) Replication code logs like crazy if a target table cannot be found.
[ https://issues.apache.org/jira/browse/HBASE-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13618207#comment-13618207 ] Jieshan Bean commented on HBASE-8229: - bq.But can we simply wait again and again until the table is created on the other side? I'm afraid we should do that. Unless we add a mechanism to check whether a table has already been deleted. But I think ReplicationSource still has the responsibility to finish all the rest edits. Any skip may cause data-loss. I think the most probable scenario of this problem is we forgot to create table for sink side. bq. At some point, if there is any failure, we will still miss the edits. [~jmspaggi] Can you show me one scenario? :) Replication code logs like crazy if a target table cannot be found. --- Key: HBASE-8229 URL: https://issues.apache.org/jira/browse/HBASE-8229 Project: HBase Issue Type: Bug Reporter: Lars Hofhansl Fix For: 0.95.0, 0.98.0, 0.94.7 One of our RS/DN machines ran out of diskspace on the partition to which we write the log files. It turns out we still had a table in our source cluster with REPLICATION_SCOPE=1 that did not have a matching table in the remote cluster. In then logged a long stack trace every 50ms or so, over a few days that filled up our log partition. Since ReplicationSource cannot make any progress in this case anyway, it should probably sleep a bit before retrying (or at least limit the rate at which it spews out these exceptions to the log). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7122) Proper warning message when opening a log file with no entries (idle cluster)
[ https://issues.apache.org/jira/browse/HBASE-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616115#comment-13616115 ] Jieshan Bean commented on HBASE-7122: - [~himan...@cloudera.com] There's one possible problem in this patch. Suppose we have several logs in recovered queue and with 1 log is empty, this change will hang the ReplicationSource thread which will keep opening the empty log. Proper warning message when opening a log file with no entries (idle cluster) - Key: HBASE-7122 URL: https://issues.apache.org/jira/browse/HBASE-7122 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.2 Reporter: Himanshu Vashishtha Assignee: Himanshu Vashishtha Fix For: 0.95.0 Attachments: HBase-7122-94.patch, HBase-7122.patch, HBASE-7122.v2.patch In case the cluster is idle and the log has rolled (offset to 0), replicationSource tries to open the log and gets an EOF exception. This gets printed after every 10 sec until an entry is inserted in it. {code} 2012-11-07 15:47:40,924 DEBUG regionserver.ReplicationSource (ReplicationSource.java:openReader(487)) - Opening log for replication c0315.hal.cloudera.com%2C40020%2C1352324202860.1352327804874 at 0 2012-11-07 15:47:40,926 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(543)) - 1 Got: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:491) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:290) 2012-11-07 15:47:40,927 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(547)) - Waited too long for this file, considering dumping 2012-11-07 15:47:40,927 DEBUG regionserver.ReplicationSource (ReplicationSource.java:sleepForRetries(562)) - Unable to open a reader, sleeping 1000 times 10 {code} We should reduce the log spewing in this case (or some informative message, based on the offset). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8212) Introduce a new separator instead of hyphen('-') for renaming recovered queues' znodes
Jieshan Bean created HBASE-8212: --- Summary: Introduce a new separator instead of hyphen('-') for renaming recovered queues' znodes Key: HBASE-8212 URL: https://issues.apache.org/jira/browse/HBASE-8212 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Fix For: 0.96.0, 0.94.7 hyphen is frequently used in the HostName. Likes we have one regionserver named 160-172-0-1, so under this scenario, 160-172-0-1 will be splited to 4 Strings and will be considered for 4 possible dead servers. It won't find all the logs for 160-172-0-1 any more, so causes data-loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8212) Introduce a new separator instead of hyphen('-') for renaming recovered queues' znodes
[ https://issues.apache.org/jira/browse/HBASE-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616131#comment-13616131 ] Jieshan Bean commented on HBASE-8212: - Ya...Sorry, I didn't see that:(. It's the same issue. Introduce a new separator instead of hyphen('-') for renaming recovered queues' znodes -- Key: HBASE-8212 URL: https://issues.apache.org/jira/browse/HBASE-8212 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Fix For: 0.96.0, 0.94.7 hyphen is frequently used in the HostName. Likes we have one regionserver named 160-172-0-1, so under this scenario, 160-172-0-1 will be splited to 4 Strings and will be considered for 4 possible dead servers. It won't find all the logs for 160-172-0-1 any more, so causes data-loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-8207: Attachment: HBASE-8212-94.patch I have finished one patch which using # instead of -. Sorry, I raised a same issue but didn't notice this has already been there:( Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin servers, this is a high impact issue. -- This message is automatically generated by JIRA. If you
[jira] [Resolved] (HBASE-8212) Introduce a new separator instead of hyphen('-') for renaming recovered queues' znodes
[ https://issues.apache.org/jira/browse/HBASE-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean resolved HBASE-8212. - Resolution: Duplicate Introduce a new separator instead of hyphen('-') for renaming recovered queues' znodes -- Key: HBASE-8212 URL: https://issues.apache.org/jira/browse/HBASE-8212 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.6 Reporter: Jieshan Bean Assignee: Jieshan Bean Fix For: 0.96.0, 0.94.7 hyphen is frequently used in the HostName. Likes we have one regionserver named 160-172-0-1, so under this scenario, 160-172-0-1 will be splited to 4 Strings and will be considered for 4 possible dead servers. It won't find all the logs for 160-172-0-1 any more, so causes data-loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616141#comment-13616141 ] Jieshan Bean commented on HBASE-8207: - We found the same problem in our test environment, attaching the logs for your reference: {noformat} 2013-03-25 04:51:20,929 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] NB dead servers : 4 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:517) 2013-03-25 04:51:20,929 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/130,60020,1364199883591/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,930 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/130,60020,1364199883591-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,932 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/0/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,934 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/0-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,935 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/172/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,937 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/172-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,938 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/160/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,939 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/160-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,941 WARN [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 1-160-172-0-130,60020,1364199883591 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563) java.io.IOException: File from recovered queue is nowhere to be found at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:545) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:311) Caused by: java.io.FileNotFoundException: File does not exist: hdfs://hacluster/hbase/.oldlogs/160-172-0-130%2C60020%2C1364199883591.1364200564291 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:752) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1692) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1716) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:511) ... 1 more {noformat} Replication could have data loss when machine name contains hyphen -
[jira] [Created] (HBASE-8213) global authorization may lose efficacy
Jieshan Bean created HBASE-8213: --- Summary: global authorization may lose efficacy Key: HBASE-8213 URL: https://issues.apache.org/jira/browse/HBASE-8213 Project: HBase Issue Type: Bug Reporter: Jieshan Bean Priority: Critical It depends on the order of which region be opened first. Suppose we have one 1 regionserver and only 1 user region REGION-A on this server, _acl_ region was on another regionserver. _acl_ was opened a few seconds before REGION-A. The global authorization data read from Zookeeper was overwritten by the data read from configuration. {code} private TableAuthManager(ZooKeeperWatcher watcher, Configuration conf) throws IOException { this.conf = conf; this.zkperms = new ZKPermissionWatcher(watcher, this, conf); try { // Read global authorization data from zookeeper. this.zkperms.start(); } catch (KeeperException ke) { LOG.error(ZooKeeper initialization failed, ke); } // It will overwrite globalCache. // initialize global permissions based on configuration globalCache = initGlobal(conf); } {code} This issue can be easily reproduced by below steps: 1. Start a cluster with 3 regionservers. 2. Create a new table T1. 3. grant a new user USER-A with global authorization. 4. Kill 1 regionserver RS3 and switch balance off. 5. Start regionserver RS3. 6. Assign region T1 to RS3. 7. Put data with user USER-A. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-8213) global authorization may lose efficacy
[ https://issues.apache.org/jira/browse/HBASE-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean reassigned HBASE-8213: --- Assignee: Jieshan Bean global authorization may lose efficacy --- Key: HBASE-8213 URL: https://issues.apache.org/jira/browse/HBASE-8213 Project: HBase Issue Type: Bug Reporter: Jieshan Bean Assignee: Jieshan Bean Priority: Critical It depends on the order of which region be opened first. Suppose we have one 1 regionserver and only 1 user region REGION-A on this server, _acl_ region was on another regionserver. _acl_ was opened a few seconds before REGION-A. The global authorization data read from Zookeeper was overwritten by the data read from configuration. {code} private TableAuthManager(ZooKeeperWatcher watcher, Configuration conf) throws IOException { this.conf = conf; this.zkperms = new ZKPermissionWatcher(watcher, this, conf); try { // Read global authorization data from zookeeper. this.zkperms.start(); } catch (KeeperException ke) { LOG.error(ZooKeeper initialization failed, ke); } // It will overwrite globalCache. // initialize global permissions based on configuration globalCache = initGlobal(conf); } {code} This issue can be easily reproduced by below steps: 1. Start a cluster with 3 regionservers. 2. Create a new table T1. 3. grant a new user USER-A with global authorization. 4. Kill 1 regionserver RS3 and switch balance off. 5. Start regionserver RS3. 6. Assign region T1 to RS3. 7. Put data with user USER-A. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8207) Replication could have data loss when machine name contains hyphen -
[ https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617056#comment-13617056 ] Jieshan Bean commented on HBASE-8207: - New patch also looks good to me. Is it neccessary to add restrictions on peer-id when calling add_peer? Replication could have data loss when machine name contains hyphen - -- Key: HBASE-8207 URL: https://issues.apache.org/jira/browse/HBASE-8207 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.95.0, 0.94.6 Reporter: Jeffrey Zhong Assignee: Jeffrey Zhong Priority: Critical Fix For: 0.95.0, 0.98.0, 0.94.7 Attachments: failed.txt, hbase-8207-0.94-v1.patch, hbase-8207.patch, hbase-8207_v1.patch, hbase-8207_v2.patch, hbase-8207_v2.patch, HBASE-8212-94.patch In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures. When a machine name contains -, it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue: deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss) You can see that replication use those weird paths constructed from deadRegionServers to check a file existence {code} 2013-03-26 21:26:51,385 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,386 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,387 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,389 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,391 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,394 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,396 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 2013-03-26 21:26:51,398 INFO [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540 {code} This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false Search for {code} File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540 {code} After 10 times retries, replication source gave up and move on to the next file. Data loss happens. Since lots of EC2 machine names contain - including our Jenkin
[jira] [Commented] (HBASE-8104) HBase consistency and availability after replication
[ https://issues.apache.org/jira/browse/HBASE-8104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608509#comment-13608509 ] Jieshan Bean commented on HBASE-8104: - I think so, this is the only way we can do that. But I don't think we really need that. HBase consistency and availability after replication Key: HBASE-8104 URL: https://issues.apache.org/jira/browse/HBASE-8104 Project: HBase Issue Type: New Feature Affects Versions: 0.94.3 Reporter: Brian Fu Priority: Critical Original Estimate: 336h Remaining Estimate: 336h HBase consistency and availability after replication Scene as follows: 1. There are two HBase clusters are the Master clusters and Slave Clusters. two clusters replication function is open. 2. if master cluster have problems, so all write and read request switching to the slave cluster. 3. After a period of time ,we need to switch back to the Master cluster, there will be a part of the data is inconsistent, lead to this part of the data is not available. This feature is particularly important for providing online services HBase cluster. So, I want through a write-back program to keep the data consistency, then to improve HBase availability. we will provide a patch for this function. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7750) We should throw IOE when calling HRegionServer#replicateLogEntries if ReplicationSink is null
[ https://issues.apache.org/jira/browse/HBASE-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13572239#comment-13572239 ] Jieshan Bean commented on HBASE-7750: - bq.Also on the sink side we could just print a message saying that someone tried to replicate to us and weren't able to accept the edits. I agree. Sink side should print this warn log. Source side need to handle this exception, otherwise, it keeps on calling shipEdits without sleep. I will submit a patch after verification. We should throw IOE when calling HRegionServer#replicateLogEntries if ReplicationSink is null - Key: HBASE-7750 URL: https://issues.apache.org/jira/browse/HBASE-7750 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.96.0, 0.94.4 Reporter: Jieshan Bean It may be an expected behavior, but I think it's better to do something. We configured hbase.replication as true in master cluster, and added peer. But forgot to configure hbase.replication on slave cluster side. ReplicationSource read HLog, shipped log edits, and logged position. Everything seemed alright. But data was not present in slave cluster. So I think, slave cluster should throw exception to master cluster instead of return directly: {code} public void replicateLogEntries(final HLog.Entry[] entries) throws IOException { checkOpen(); if (this.replicationSinkHandler == null) return; this.replicationSinkHandler.replicateLogEntries(entries); } {code} I would like to hear your comments on this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-7750) We should throw IOE when calling HRegionServer#replicateLogEntries if ReplicationSink is null
[ https://issues.apache.org/jira/browse/HBASE-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean reassigned HBASE-7750: --- Assignee: Jieshan Bean We should throw IOE when calling HRegionServer#replicateLogEntries if ReplicationSink is null - Key: HBASE-7750 URL: https://issues.apache.org/jira/browse/HBASE-7750 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.96.0, 0.94.4 Reporter: Jieshan Bean Assignee: Jieshan Bean It may be an expected behavior, but I think it's better to do something. We configured hbase.replication as true in master cluster, and added peer. But forgot to configure hbase.replication on slave cluster side. ReplicationSource read HLog, shipped log edits, and logged position. Everything seemed alright. But data was not present in slave cluster. So I think, slave cluster should throw exception to master cluster instead of return directly: {code} public void replicateLogEntries(final HLog.Entry[] entries) throws IOException { checkOpen(); if (this.replicationSinkHandler == null) return; this.replicationSinkHandler.replicateLogEntries(entries); } {code} I would like to hear your comments on this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7750) We should throw IOE when calling HRegionServer#replicateLogEntries if ReplicationSink is null
Jieshan Bean created HBASE-7750: --- Summary: We should throw IOE when calling HRegionServer#replicateLogEntries if ReplicationSink is null Key: HBASE-7750 URL: https://issues.apache.org/jira/browse/HBASE-7750 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4, 0.96.0 Reporter: Jieshan Bean It may be an expected behavior, but I think it's better to do something. We configured hbase.replication as true in master cluster, and added peer. But forgot to configure hbase.replication on slave cluster side. ReplicationSource read HLog, shipped log edits, and logged position. Everything seemed alright. But data was not present in slave cluster. So I think, slave cluster should throw exception to master cluster instead of return directly: {code} public void replicateLogEntries(final HLog.Entry[] entries) throws IOException { checkOpen(); if (this.replicationSinkHandler == null) return; this.replicationSinkHandler.replicateLogEntries(entries); } {code} I would like to hear your comments on this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7705) Can we make the method getCurrentPoolSize of HTablePool public?
[ https://issues.apache.org/jira/browse/HBASE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565320#comment-13565320 ] Jieshan Bean commented on HBASE-7705: - I think we can make it public. Can we make the method getCurrentPoolSize of HTablePool public? --- Key: HBASE-7705 URL: https://issues.apache.org/jira/browse/HBASE-7705 Project: HBase Issue Type: Wish Components: Client Affects Versions: 0.94.3 Reporter: cuijianwei Priority: Minor We use HTablePool to manager opened HTable in our applications. We want to track the usage of HTablePool for different table names. Then we discover that HTablePool#getCurrentPoolSize could help us: {code} int getCurrentPoolSize(String tableName) { return tables.size(tableName); } {code} However, this method could only be called in the hbase client package. Can we make this method public? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7694) HBASE-6165 has been out of action after enabling security
Jieshan Bean created HBASE-7694: --- Summary: HBASE-6165 has been out of action after enabling security Key: HBASE-7694 URL: https://issues.apache.org/jira/browse/HBASE-7694 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.94.4, 0.96.0 Environment: Replication should use an independent msg queue, otherwise replication msg may occupy all the handlers. This feature added in HBase-6165, but becomes invalid after enabling security. Reporter: Jieshan Bean -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-7694) HBASE-6165 has been out of action after enabling security
[ https://issues.apache.org/jira/browse/HBASE-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean reassigned HBASE-7694: --- Assignee: Jieshan Bean HBASE-6165 has been out of action after enabling security - Key: HBASE-7694 URL: https://issues.apache.org/jira/browse/HBASE-7694 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.96.0, 0.94.4 Environment: Replication should use an independent msg queue, otherwise replication msg may occupy all the handlers. This feature added in HBase-6165, but becomes invalid after enabling security. Reporter: Jieshan Bean Assignee: Jieshan Bean -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7694) HBASE-6165 has been out of action after enabling security
[ https://issues.apache.org/jira/browse/HBASE-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-7694: Description: Replication should use an independent msg queue, otherwise replication msg may occupy all the handlers. This feature added in HBase-6165, but becomes invalid after enabling security. Environment: (was: Replication should use an independent msg queue, otherwise replication msg may occupy all the handlers. This feature added in HBase-6165, but becomes invalid after enabling security.) HBASE-6165 has been out of action after enabling security - Key: HBASE-7694 URL: https://issues.apache.org/jira/browse/HBASE-7694 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.96.0, 0.94.4 Reporter: Jieshan Bean Assignee: Jieshan Bean Replication should use an independent msg queue, otherwise replication msg may occupy all the handlers. This feature added in HBase-6165, but becomes invalid after enabling security. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7694) HBASE-6165 has been out of action after enabling security
[ https://issues.apache.org/jira/browse/HBASE-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564216#comment-13564216 ] Jieshan Bean commented on HBASE-7694: - I will submit a patch after testing. HBASE-6165 has been out of action after enabling security - Key: HBASE-7694 URL: https://issues.apache.org/jira/browse/HBASE-7694 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.96.0, 0.94.4 Reporter: Jieshan Bean Assignee: Jieshan Bean Replication should use an independent msg queue, otherwise replication msg may occupy all the handlers. This feature added in HBase-6165, but becomes invalid after enabling security. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7694) HBASE-6165 has been out of action after enabling security
[ https://issues.apache.org/jira/browse/HBASE-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564917#comment-13564917 ] Jieshan Bean commented on HBASE-7694: - bq.Do you see your master cluster regionservers successfully connecting to the zk quorum and then region servers of the slave cluster? Yes.Master cluster regionservers could connect to slave cluster successfully. We use the same KDC configurations, and the same principals. HBASE-6165 has been out of action after enabling security - Key: HBASE-7694 URL: https://issues.apache.org/jira/browse/HBASE-7694 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.96.0, 0.94.4 Reporter: Jieshan Bean Assignee: Jieshan Bean Priority: Critical Fix For: 0.96.0, 0.94.5 Replication should use an independent msg queue, otherwise replication msg may occupy all the handlers. This feature added in HBASE-6165, but becomes invalid after enabling security. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7694) HBASE-6165 has been out of action after enabling security
[ https://issues.apache.org/jira/browse/HBASE-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-7694: Attachment: HBASE-7694-94.patch Patch for 94. HBASE-6165 has been out of action after enabling security - Key: HBASE-7694 URL: https://issues.apache.org/jira/browse/HBASE-7694 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.96.0, 0.94.4 Reporter: Jieshan Bean Assignee: Jieshan Bean Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7694-94.patch Replication should use an independent msg queue, otherwise replication msg may occupy all the handlers. This feature added in HBASE-6165, but becomes invalid after enabling security. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7694) HBASE-6165 has been out of action after enabling security
[ https://issues.apache.org/jira/browse/HBASE-7694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564977#comment-13564977 ] Jieshan Bean commented on HBASE-7694: - Yes, I think so. HBASE-6165 has been out of action after enabling security - Key: HBASE-7694 URL: https://issues.apache.org/jira/browse/HBASE-7694 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 0.96.0, 0.94.4 Reporter: Jieshan Bean Assignee: Jieshan Bean Priority: Critical Fix For: 0.96.0, 0.94.5 Attachments: HBASE-7694-94.patch Replication should use an independent msg queue, otherwise replication msg may occupy all the handlers. This feature added in HBASE-6165, but becomes invalid after enabling security. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7324) Archive the logs instead of deletion after distributed splitting
[ https://issues.apache.org/jira/browse/HBASE-7324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540697#comment-13540697 ] Jieshan Bean commented on HBASE-7324: - Sorry, I misread the code. It's not a problem. Archive the logs instead of deletion after distributed splitting Key: HBASE-7324 URL: https://issues.apache.org/jira/browse/HBASE-7324 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.94.3, 0.96.0 Reporter: Jieshan Bean We should always move the logs to .oldlogs instead of deleting them directly. The negative effect of this bug may cause data-loss if we enabled replication. The below code is extracted from SplitLogManager#splitLogDistributed: {code} for(Path logDir: logDirs){ status.setStatus(Cleaning up log directory...); try { if (fs.exists(logDir) !fs.delete(logDir, false)) { LOG.warn(Unable to delete log src dir. Ignoring. + logDir); } } catch (IOException ioe) { FileStatus[] files = fs.listStatus(logDir); if (files != null files.length 0) { LOG.warn(returning success without actually splitting and + deleting all the log files in path + logDir); } else { LOG.warn(Unable to delete log src dir. Ignoring. + logDir, ioe); } } tot_mgr_log_split_batch_success.incrementAndGet(); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-7324) Archive the logs instead of deletion after distributed splitting
[ https://issues.apache.org/jira/browse/HBASE-7324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean resolved HBASE-7324. - Resolution: Invalid Archive the logs instead of deletion after distributed splitting Key: HBASE-7324 URL: https://issues.apache.org/jira/browse/HBASE-7324 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.94.3, 0.96.0 Reporter: Jieshan Bean We should always move the logs to .oldlogs instead of deleting them directly. The negative effect of this bug may cause data-loss if we enabled replication. The below code is extracted from SplitLogManager#splitLogDistributed: {code} for(Path logDir: logDirs){ status.setStatus(Cleaning up log directory...); try { if (fs.exists(logDir) !fs.delete(logDir, false)) { LOG.warn(Unable to delete log src dir. Ignoring. + logDir); } } catch (IOException ioe) { FileStatus[] files = fs.listStatus(logDir); if (files != null files.length 0) { LOG.warn(returning success without actually splitting and + deleting all the log files in path + logDir); } else { LOG.warn(Unable to delete log src dir. Ignoring. + logDir, ioe); } } tot_mgr_log_split_batch_success.incrementAndGet(); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7324) Archive the logs instead of deletion after distributed splitting
Jieshan Bean created HBASE-7324: --- Summary: Archive the logs instead of deletion after distributed splitting Key: HBASE-7324 URL: https://issues.apache.org/jira/browse/HBASE-7324 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.94.3, 0.96.0 Reporter: Jieshan Bean We should always move the logs to .oldlogs instead of deleting them directly. The negative effect of this bug may cause data-loss if we enabled replication. The below code is extracted from SplitLogManager#splitLogDistributed: {code} for(Path logDir: logDirs){ status.setStatus(Cleaning up log directory...); try { if (fs.exists(logDir) !fs.delete(logDir, false)) { LOG.warn(Unable to delete log src dir. Ignoring. + logDir); } } catch (IOException ioe) { FileStatus[] files = fs.listStatus(logDir); if (files != null files.length 0) { LOG.warn(returning success without actually splitting and + deleting all the log files in path + logDir); } else { LOG.warn(Unable to delete log src dir. Ignoring. + logDir, ioe); } } tot_mgr_log_split_batch_success.incrementAndGet(); } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7313) ColumnPaginationFilter should reset count when moving to NEXT_ROW
[ https://issues.apache.org/jira/browse/HBASE-7313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13528581#comment-13528581 ] Jieshan Bean commented on HBASE-7313: - This reset has been there in ColumnPaginationFilter#reset(), right? ColumnPaginationFilter should reset count when moving to NEXT_ROW - Key: HBASE-7313 URL: https://issues.apache.org/jira/browse/HBASE-7313 Project: HBase Issue Type: Bug Components: Filters Affects Versions: 0.94.3, 0.96.0 Reporter: Varun Sharma Assignee: Varun Sharma Fix For: 0.96.0, 0.94.4 Attachments: 7313-0.94.txt, 7313-trunk.txt ColumnPaginationFilter does not reset count to zero on moving to next row. Hence, if we have already gotten limit number of columns - the subsequent rows will always return 0 columns. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7008) Set scanner caching to a better default
[ https://issues.apache.org/jira/browse/HBASE-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479588#comment-13479588 ] Jieshan Bean commented on HBASE-7008: - Thanks for the patch, xieliang. I suggest to introduce a new member variable in HConstants to define this dafault value. What do you think? Set scanner caching to a better default --- Key: HBASE-7008 URL: https://issues.apache.org/jira/browse/HBASE-7008 Project: HBase Issue Type: Bug Components: Client Reporter: liang xie Assignee: liang xie Attachments: HBASE-7008.patch per http://search-hadoop.com/m/qaRu9iM2f02/Set+scanner+caching+to+a+better+default%253Fsubj=Set+scanner+caching+to+a+better+default+ let's set to 100 by default -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7008) Set scanner caching to a better default
[ https://issues.apache.org/jira/browse/HBASE-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479603#comment-13479603 ] Jieshan Bean commented on HBASE-7008: - Ok, fine. Magic number is always not good, but anyway it's not a problem. Set scanner caching to a better default --- Key: HBASE-7008 URL: https://issues.apache.org/jira/browse/HBASE-7008 Project: HBase Issue Type: Bug Components: Client Reporter: liang xie Assignee: liang xie Attachments: HBASE-7008.patch per http://search-hadoop.com/m/qaRu9iM2f02/Set+scanner+caching+to+a+better+default%253Fsubj=Set+scanner+caching+to+a+better+default+ let's set to 100 by default -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6854) Deletion of SPLITTING node on split rollback should clear the region from RIT
[ https://issues.apache.org/jira/browse/HBASE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461613#comment-13461613 ] Jieshan Bean commented on HBASE-6854: - I think it's ok:) Deletion of SPLITTING node on split rollback should clear the region from RIT - Key: HBASE-6854 URL: https://issues.apache.org/jira/browse/HBASE-6854 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Fix For: 0.94.3 Attachments: HBASE-6854.patch If a failure happens in split before OFFLINING_PARENT, we tend to rollback the split including deleting the znodes created. On deletion of the RS_ZK_SPLITTING node we are getting a callback but not remvoving from RIT. We need to remove it from RIT, anyway SSH logic is well guarded in case the delete event comes due to RS down scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5950) Add a decimal comparator for Filter
[ https://issues.apache.org/jira/browse/HBASE-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461734#comment-13461734 ] Jieshan Bean commented on HBASE-5950: - This comparator is not needed if we store the Integer/Double/Float as bytes directly. Right? Add a decimal comparator for Filter --- Key: HBASE-5950 URL: https://issues.apache.org/jira/browse/HBASE-5950 Project: HBase Issue Type: New Feature Components: Filters Affects Versions: 0.94.0, 0.96.0 Reporter: Jieshan Bean Assignee: Jieshan Bean Suppose we have a requirement like below: we want to get the rows with one specified column value larger than A and less than B. (They are all decimals or integers) namely: A Integer.valueof(column) B Use BinaryComparator will not help us to archive that goal: e.g. suppose A = 100, B = 200, one column value is 11. So it can satisfy that condition, but it's not the row we wanted. So I suggest to add one comparator to help implementing this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6854) Deletion of SPLITTING node on split rollback should clear the region from RIT
[ https://issues.apache.org/jira/browse/HBASE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461565#comment-13461565 ] Jieshan Bean commented on HBASE-6854: - We also found the same problem. Only 2 minor comments: 1. {code} LOG.debug(Ephemeral node deleted. Found in SPLIITING state. + Removing from RIT {code} SPLIITING should be SPLITTING. 2. HBaseAdmin in testShouldClearRITWhenNodeFoundInSplittingState should be closed in finally block. Otherwise, I'm +1 on this patch. Deletion of SPLITTING node on split rollback should clear the region from RIT - Key: HBASE-6854 URL: https://issues.apache.org/jira/browse/HBASE-6854 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Fix For: 0.94.3 Attachments: HBASE-6854.patch If a failure happens in split before OFFLINING_PARENT, we tend to rollback the split including deleting the znodes created. On deletion of the RS_ZK_SPLITTING node we are getting a callback but not remvoving from RIT. We need to remove it from RIT, anyway SSH logic is well guarded in case the delete event comes due to RS down scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6491) add limit function at ClientScanner
[ https://issues.apache.org/jira/browse/HBASE-6491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13460126#comment-13460126 ] Jieshan Bean commented on HBASE-6491: - @ronghai: Why not use PageFilter instead of adding this new method? add limit function at ClientScanner --- Key: HBASE-6491 URL: https://issues.apache.org/jira/browse/HBASE-6491 Project: HBase Issue Type: New Feature Components: Client Affects Versions: 0.96.0 Reporter: ronghai.ma Assignee: ronghai.ma Labels: patch Fix For: 0.96.0 Attachments: ClientScanner.java, HBASE-6491.patch Add a new method in ClientScanner to implement a function like LIMIT in MySQL. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6748) Endless recursive of deleteNode happened in SplitLogManager#DeleteAsyncCallback
[ https://issues.apache.org/jira/browse/HBASE-6748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453587#comment-13453587 ] Jieshan Bean commented on HBASE-6748: - Yes. Long.MAX_VALUE is the problem. Endless recursive of deleteNode happened in SplitLogManager#DeleteAsyncCallback --- Key: HBASE-6748 URL: https://issues.apache.org/jira/browse/HBASE-6748 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.96.0, 0.94.1 Reporter: Jieshan Bean Priority: Critical Fix For: 0.96.0, 0.94.3 You can ealily understand the problem from the below logs: {code} [2012-09-01 11:41:02,062] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$CreateAsyncCallback 978] create rc =SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=3 [2012-09-01 11:41:02,062] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$CreateAsyncCallback 978] create rc =SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=2 [2012-09-01 11:41:02,063] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$CreateAsyncCallback 978] create rc =SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=1 [2012-09-01 11:41:02,063] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$CreateAsyncCallback 978] create rc =SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=0 [2012-09-01 11:41:02,063] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager 393] failed to create task node/hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 [2012-09-01 11:41:02,063] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager 353] Error splitting /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 [2012-09-01 11:41:02,063] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback 1052] delete rc=SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=9223372036854775807 [2012-09-01 11:41:02,064] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback 1052] delete rc=SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=9223372036854775806 [2012-09-01 11:41:02,064] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback 1052] delete rc=SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=9223372036854775805 [2012-09-01 11:41:02,064] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback 1052] delete rc=SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=9223372036854775804 [2012-09-01 11:41:02,065] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback 1052] delete rc=SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=9223372036854775803 ... [2012-09-01 11:41:03,307] [ERROR]
[jira] [Commented] (HBASE-6748) Endless recursive of deleteNode happened in SplitLogManager#DeleteAsyncCallback
[ https://issues.apache.org/jira/browse/HBASE-6748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453666#comment-13453666 ] Jieshan Bean commented on HBASE-6748: - Either. Both Master starts up and region server failure handling may trigger HLog splitting. Yes, I think HMaster should abort when sessionTimeout happens. Endless recursive of deleteNode happened in SplitLogManager#DeleteAsyncCallback --- Key: HBASE-6748 URL: https://issues.apache.org/jira/browse/HBASE-6748 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.96.0, 0.94.1 Reporter: Jieshan Bean Priority: Critical Fix For: 0.96.0, 0.94.3 You can ealily understand the problem from the below logs: {code} [2012-09-01 11:41:02,062] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$CreateAsyncCallback 978] create rc =SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=3 [2012-09-01 11:41:02,062] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$CreateAsyncCallback 978] create rc =SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=2 [2012-09-01 11:41:02,063] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$CreateAsyncCallback 978] create rc =SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=1 [2012-09-01 11:41:02,063] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$CreateAsyncCallback 978] create rc =SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=0 [2012-09-01 11:41:02,063] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager 393] failed to create task node/hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 [2012-09-01 11:41:02,063] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager 353] Error splitting /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 [2012-09-01 11:41:02,063] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback 1052] delete rc=SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=9223372036854775807 [2012-09-01 11:41:02,064] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback 1052] delete rc=SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=9223372036854775806 [2012-09-01 11:41:02,064] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback 1052] delete rc=SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=9223372036854775805 [2012-09-01 11:41:02,064] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback 1052] delete rc=SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846 remaining retries=9223372036854775804 [2012-09-01 11:41:02,065] [WARN ] [MASTER_SERVER_OPERATIONS-xh03,2,1339549619270-1] [org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback 1052] delete rc=SESSIONEXPIRED for /hbase/splitlog/hdfs%3A%2F%2Fxh01%3A9000%2Fhbase%2F.logs%2Fxh01%2C20020%2C1339552105088-splitting%2Fxh01%252C20020%252C1339552105088.1339557014846