[jira] [Commented] (HBASE-13461) RegionSever Hlog flush BLOCKED on hbase-0.96.2-hadoop2

2015-04-14 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493812#comment-14493812
 ] 

Liang Xie commented on HBASE-13461:
---

Usually, it's not a hbase issue.  Most of times, waitForAckedSeqno means the 
datanode write pipeline is somehow slow, maybe there's a slow write io or 
datanode internal lock.

 RegionSever Hlog flush BLOCKED  on  hbase-0.96.2-hadoop2
 

 Key: HBASE-13461
 URL: https://issues.apache.org/jira/browse/HBASE-13461
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.96.2
 Environment: hbase-0.96.2-hadoop2   hadoop2.2.0
Reporter: zhangjg

 I try to dump  thread stack below:
 RpcServer.handler=63,port=60020 daemon prio=10 tid=0x7fdcddc5d000 
 nid=0x5f9 waiting for monitor entry [0x7fd289194000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98)
 - waiting to lock 0x7fd36c023728 (a 
 org.apache.hadoop.hdfs.DFSOutputStream)
 at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:59)
 at java.io.DataOutputStream.write(DataOutputStream.java:90)
 - locked 0x7fd510cfdc28 (a 
 org.apache.hadoop.hdfs.client.HdfsDataOutputStream)
 at 
 com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
 at 
 com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
 at 
 com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91)
 at 
 org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:87)
 at 
 org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.hlogFlush(FSHLog.java:1026)
 at 
 org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1075)
 - locked 0x7fd2d9bbfad0 (a java.lang.Object)
 at 
 org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1240)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.syncOrDefer(HRegion.java:5593)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2315)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2028)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.doBatchOp(HRegionServer.java:4094)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.doNonAtomicRegionMutation(HRegionServer.java:3380)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3284)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26935)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2185)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1889)
 RpcServer.handler=12,port=60020 daemon prio=10 tid=0x7fdcddf2c800 
 nid=0x5c6 in Object.wait() [0x7fd28c4c7000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:1803)
 - locked 0x7fd45857c540 (a java.util.LinkedList)
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1697)
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1590)
 at 
 org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1575)
 at 
 org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:121)
 at 
 org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:135)
 at 
 org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1098)
 at 
 org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1240)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.syncOrDefer(HRegion.java:5593)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2315)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2028)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.doBatchOp(HRegionServer.java:4094)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.doNonAtomicRegionMutation(HRegionServer.java:3380)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3284)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26935)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2185)
 at 
 org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1889)
 

[jira] [Updated] (HBASE-13454) SecureClient#setupIOStreams should handle all Exception

2015-04-13 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-13454:
--
Assignee: cuijianwei

 SecureClient#setupIOStreams should handle all Exception
 ---

 Key: HBASE-13454
 URL: https://issues.apache.org/jira/browse/HBASE-13454
 Project: HBase
  Issue Type: Bug
  Components: Client, security
Affects Versions: 0.94.27
Reporter: cuijianwei
Assignee: cuijianwei
 Attachments: HBASE-13454-0.94.patch


 Similar to [HBASE-9256 | https://issues.apache.org/jira/browse/HBASE-9256], 
 SecureClient#setupIOStreams should also catch up all Exception. In our 
 production environment, an OutOfMemoryError is thrown in 
 SecureClient#setupIOStreams, leaving an unclosed SecureConnection with its 
 inputStream/outputStream still null. Therefore, following calls will be 
 attached to this connection and will encounter NPE in HBaseClient#sendParam:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:619)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:977)
 at 
 org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107)
 at $Proxy21.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:772)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:451)
 {code}
 We'd fix this in 0.94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13454) SecureClient#setupIOStreams should handle all Exception

2015-04-13 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-13454:
--
Status: Patch Available  (was: Open)

 SecureClient#setupIOStreams should handle all Exception
 ---

 Key: HBASE-13454
 URL: https://issues.apache.org/jira/browse/HBASE-13454
 Project: HBase
  Issue Type: Bug
  Components: Client, security
Affects Versions: 0.94.27
Reporter: cuijianwei
Assignee: cuijianwei
 Attachments: HBASE-13454-0.94.patch


 Similar to [HBASE-9256 | https://issues.apache.org/jira/browse/HBASE-9256], 
 SecureClient#setupIOStreams should also catch up all Exception. In our 
 production environment, an OutOfMemoryError is thrown in 
 SecureClient#setupIOStreams, leaving an unclosed SecureConnection with its 
 inputStream/outputStream still null. Therefore, following calls will be 
 attached to this connection and will encounter NPE in HBaseClient#sendParam:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:619)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:977)
 at 
 org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107)
 at $Proxy21.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:772)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:451)
 {code}
 We'd fix this in 0.94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13454) SecureClient#setupIOStreams should handle all Exception

2015-04-13 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492047#comment-14492047
 ] 

Liang Xie commented on HBASE-13454:
---

+1 from me.  And jianwei, you need to ping [~lhofhansl] about 0.94 branch since 
he is the 0.94 branch RM

 SecureClient#setupIOStreams should handle all Exception
 ---

 Key: HBASE-13454
 URL: https://issues.apache.org/jira/browse/HBASE-13454
 Project: HBase
  Issue Type: Bug
  Components: Client, security
Affects Versions: 0.94.27
Reporter: cuijianwei
Assignee: cuijianwei
 Attachments: HBASE-13454-0.94.patch


 Similar to [HBASE-9256 | https://issues.apache.org/jira/browse/HBASE-9256], 
 SecureClient#setupIOStreams should also catch up all Exception. In our 
 production environment, an OutOfMemoryError is thrown in 
 SecureClient#setupIOStreams, leaving an unclosed SecureConnection with its 
 inputStream/outputStream still null. Therefore, following calls will be 
 attached to this connection and will encounter NPE in HBaseClient#sendParam:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:619)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:977)
 at 
 org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107)
 at $Proxy21.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:772)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:451)
 {code}
 We'd fix this in 0.94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HBASE-13142) [PERF] Reuse the IPCUtil#buildCellBlock buffer

2015-03-09 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie reassigned HBASE-13142:
-

Assignee: Liang Xie  (was: stack)

 [PERF] Reuse the IPCUtil#buildCellBlock buffer
 --

 Key: HBASE-13142
 URL: https://issues.apache.org/jira/browse/HBASE-13142
 Project: HBase
  Issue Type: Improvement
  Components: Performance
Reporter: stack
Assignee: Liang Xie
  Labels: beginner
 Fix For: 2.0.0, 1.1.0

 Attachments: 13142.txt, 13142v2.txt, 13142v3.txt, 13142v5.0.98.txt, 
 13142v5.txt, 13142v5.txt, buffers.svg, clean.svg, gc.png, gc_time_spent.png, 
 hits.png, net.png, traces.2.svg, traces.svg


 Running some scan profiling, flight recorder was mildly fingering resize of 
 the buffer allocated in IPCUtil#buildCellBlock as a point of contention.  It 
 was half-hearted blaming it for a few hundreds of ms over a five minute 
 sampling with a few tens of instances showing.
 I tried then w/ flamegraph/lightweight profiler and this reported the buffer 
 allocations taking 22% of our total CPU. See attachment trace.svg.
 I enabled TRACE-level logging on org.apache.hadoop.hbase.ipc.IPCUtil and 
 indeed every allocation was doing a resize from initial allocation of 16k -- 
 the default up to 220k (this test returns ten randomly sized rows zipfian 
 sized between 0 and 8k).
 Upping the allocation to 220k meant we now avoided the resize but the initial 
 allocation was now blamed for 10% of allocations (see trace.2.svg attached).
 Lets do buffer reuse.  Will save a bunch of allocation and CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12411) Optionally enable p-reads and private readers for compactions

2015-01-11 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-12411:
--
Attachment: HBASE-12411-0.94.txt

 Optionally enable p-reads and private readers for compactions
 -

 Key: HBASE-12411
 URL: https://issues.apache.org/jira/browse/HBASE-12411
 Project: HBase
  Issue Type: Improvement
  Components: Performance
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 2.0.0, 0.98.9, 0.99.2

 Attachments: 12411-v2.txt, 12411-v3.txt, 12411-v4.txt, 12411.txt, 
 HBASE-12411-0.94.txt


 In the light of HDFS-6735 we might want to consider refraining from seek + 
 read completely and only perform preads.
 For example currently a compaction can lock out every other scanner over the 
 file which the compaction is currently reading for compaction.
 At the very least we can introduce an option to avoid seek + read, so we can 
 allow testing this in various scenarios.
 This will definitely be of great importance for projects like Phoenix which 
 parallelize queries intra region (and hence readers will used concurrently by 
 multiple scanner with high likelihood.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12767) Fix a StoreFileScanner NPE in reverse scan flow

2014-12-29 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260725#comment-14260725
 ] 

Liang Xie commented on HBASE-12767:
---

Thanks all for review, will commit shortly

 Fix a StoreFileScanner NPE in reverse scan flow
 ---

 Key: HBASE-12767
 URL: https://issues.apache.org/jira/browse/HBASE-12767
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Scanners
Affects Versions: 1.0.0, 2.0.0, 0.98.9
Reporter: Liang Xie
Assignee: Liang Xie
 Attachments: HBASE-12767.txt


 we observed a NPE in our internal 0.94 branch prd env:
 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:409)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.backwardSeek(StoreFileScanner.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.seekScanners(ReversedStoreScanner.java:78)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:154)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.init(ReversedStoreScanner.java:49)
 at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2216)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3984)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedRegionScannerImpl.init(ReversedRegionScannerImpl.java:43)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1851)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1837)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1812)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.internalOpenScanner(HRegionServer.java:2752)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2715)
 After tracing, it turned out a boundary bug at HalfStoreFileReader. If we 
 have a reverse scan request at bottom part, once the start row = the last 
 row at bottom part, the NPE will be encourtered, since we will invoke 
 next(), and atEnd = true will be called in HalfStoreFileReader, then when 
 we want to do seekBefore or other sth in StoreFileScanner,  
 StoreFileScanner.cur will call:
 {code}
   public Cell getKeyValue() {
 if (atEnd) return null;
 {code}
 The fix is simple, we need to reset the atEnd flag once seekBefore op is 
 successful. You can undo my HalfStoreFileReader change and run the new added 
 case to see NPE.
 ps: I also found some cases in TestHRegion missing the @Test annotation, so 
 fix them in the same patch as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12767) Fix a StoreFileScanner NPE in reverse scan flow

2014-12-29 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-12767:
--
Fix Version/s: 1.1.0
   0.98.10
   2.0.0
   1.0.0

 Fix a StoreFileScanner NPE in reverse scan flow
 ---

 Key: HBASE-12767
 URL: https://issues.apache.org/jira/browse/HBASE-12767
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Scanners
Affects Versions: 1.0.0, 2.0.0, 0.98.9
Reporter: Liang Xie
Assignee: Liang Xie
 Fix For: 1.0.0, 2.0.0, 0.98.10, 1.1.0

 Attachments: HBASE-12767.txt


 we observed a NPE in our internal 0.94 branch prd env:
 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:409)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.backwardSeek(StoreFileScanner.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.seekScanners(ReversedStoreScanner.java:78)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:154)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.init(ReversedStoreScanner.java:49)
 at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2216)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3984)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedRegionScannerImpl.init(ReversedRegionScannerImpl.java:43)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1851)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1837)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1812)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.internalOpenScanner(HRegionServer.java:2752)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2715)
 After tracing, it turned out a boundary bug at HalfStoreFileReader. If we 
 have a reverse scan request at bottom part, once the start row = the last 
 row at bottom part, the NPE will be encourtered, since we will invoke 
 next(), and atEnd = true will be called in HalfStoreFileReader, then when 
 we want to do seekBefore or other sth in StoreFileScanner,  
 StoreFileScanner.cur will call:
 {code}
   public Cell getKeyValue() {
 if (atEnd) return null;
 {code}
 The fix is simple, we need to reset the atEnd flag once seekBefore op is 
 successful. You can undo my HalfStoreFileReader change and run the new added 
 case to see NPE.
 ps: I also found some cases in TestHRegion missing the @Test annotation, so 
 fix them in the same patch as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12767) Fix a StoreFileScanner NPE in reverse scan flow

2014-12-29 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-12767:
--
Hadoop Flags: Reviewed

 Fix a StoreFileScanner NPE in reverse scan flow
 ---

 Key: HBASE-12767
 URL: https://issues.apache.org/jira/browse/HBASE-12767
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Scanners
Affects Versions: 1.0.0, 2.0.0, 0.98.9
Reporter: Liang Xie
Assignee: Liang Xie
 Fix For: 1.0.0, 2.0.0, 0.98.10, 1.1.0

 Attachments: HBASE-12767.txt


 we observed a NPE in our internal 0.94 branch prd env:
 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:409)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.backwardSeek(StoreFileScanner.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.seekScanners(ReversedStoreScanner.java:78)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:154)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.init(ReversedStoreScanner.java:49)
 at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2216)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3984)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedRegionScannerImpl.init(ReversedRegionScannerImpl.java:43)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1851)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1837)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1812)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.internalOpenScanner(HRegionServer.java:2752)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2715)
 After tracing, it turned out a boundary bug at HalfStoreFileReader. If we 
 have a reverse scan request at bottom part, once the start row = the last 
 row at bottom part, the NPE will be encourtered, since we will invoke 
 next(), and atEnd = true will be called in HalfStoreFileReader, then when 
 we want to do seekBefore or other sth in StoreFileScanner,  
 StoreFileScanner.cur will call:
 {code}
   public Cell getKeyValue() {
 if (atEnd) return null;
 {code}
 The fix is simple, we need to reset the atEnd flag once seekBefore op is 
 successful. You can undo my HalfStoreFileReader change and run the new added 
 case to see NPE.
 ps: I also found some cases in TestHRegion missing the @Test annotation, so 
 fix them in the same patch as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12767) Fix a StoreFileScanner NPE in reverse scan flow

2014-12-29 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-12767:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Fix a StoreFileScanner NPE in reverse scan flow
 ---

 Key: HBASE-12767
 URL: https://issues.apache.org/jira/browse/HBASE-12767
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Scanners
Affects Versions: 1.0.0, 2.0.0, 0.98.9
Reporter: Liang Xie
Assignee: Liang Xie
 Fix For: 1.0.0, 2.0.0, 0.98.10, 1.1.0

 Attachments: HBASE-12767.txt


 we observed a NPE in our internal 0.94 branch prd env:
 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:409)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.backwardSeek(StoreFileScanner.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.seekScanners(ReversedStoreScanner.java:78)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:154)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.init(ReversedStoreScanner.java:49)
 at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2216)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3984)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedRegionScannerImpl.init(ReversedRegionScannerImpl.java:43)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1851)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1837)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1812)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.internalOpenScanner(HRegionServer.java:2752)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2715)
 After tracing, it turned out a boundary bug at HalfStoreFileReader. If we 
 have a reverse scan request at bottom part, once the start row = the last 
 row at bottom part, the NPE will be encourtered, since we will invoke 
 next(), and atEnd = true will be called in HalfStoreFileReader, then when 
 we want to do seekBefore or other sth in StoreFileScanner,  
 StoreFileScanner.cur will call:
 {code}
   public Cell getKeyValue() {
 if (atEnd) return null;
 {code}
 The fix is simple, we need to reset the atEnd flag once seekBefore op is 
 successful. You can undo my HalfStoreFileReader change and run the new added 
 case to see NPE.
 ps: I also found some cases in TestHRegion missing the @Test annotation, so 
 fix them in the same patch as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12767) Fix a StoreFileScanner NPE in reverse scan flow

2014-12-28 Thread Liang Xie (JIRA)
Liang Xie created HBASE-12767:
-

 Summary: Fix a StoreFileScanner NPE in reverse scan flow
 Key: HBASE-12767
 URL: https://issues.apache.org/jira/browse/HBASE-12767
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Scanners
Affects Versions: 0.98.9, 1.0.0, 2.0.0
Reporter: Liang Xie
Assignee: Liang Xie


we observed a NPE in our internal 0.94 branch prd env:

ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:409)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.backwardSeek(StoreFileScanner.java:467)
at 
org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.seekScanners(ReversedStoreScanner.java:78)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:154)
at 
org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.init(ReversedStoreScanner.java:49)
at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2216)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3984)
at 
org.apache.hadoop.hbase.regionserver.ReversedRegionScannerImpl.init(ReversedRegionScannerImpl.java:43)
at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1851)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1837)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1812)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.internalOpenScanner(HRegionServer.java:2752)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2715)

After tracing, it turned out a boundary bug at HalfStoreFileReader. If we have 
a reverse scan request at bottom part, once the start row = the last row at 
bottom part, the NPE will be encourtered, since we will invoke next(), and 
atEnd = true will be called in HalfStoreFileReader, then when we want to do 
seekBefore or other sth in StoreFileScanner,  StoreFileScanner.cur will call:
{code}
  public Cell getKeyValue() {
if (atEnd) return null;
{code}

The fix is simple, we need to reset the atEnd flag once seekBefore op is 
successful. You can undo my HalfStoreFileReader change and run the next case to 
see NPE.
ps: I also found some cases in TestHRegion missing the @Test annotation, so 
fix them in the same patch as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12767) Fix a StoreFileScanner NPE in reverse scan flow

2014-12-28 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-12767:
--
Description: 
we observed a NPE in our internal 0.94 branch prd env:

ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:409)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.backwardSeek(StoreFileScanner.java:467)
at 
org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.seekScanners(ReversedStoreScanner.java:78)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:154)
at 
org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.init(ReversedStoreScanner.java:49)
at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2216)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3984)
at 
org.apache.hadoop.hbase.regionserver.ReversedRegionScannerImpl.init(ReversedRegionScannerImpl.java:43)
at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1851)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1837)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1812)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.internalOpenScanner(HRegionServer.java:2752)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2715)

After tracing, it turned out a boundary bug at HalfStoreFileReader. If we have 
a reverse scan request at bottom part, once the start row = the last row at 
bottom part, the NPE will be encourtered, since we will invoke next(), and 
atEnd = true will be called in HalfStoreFileReader, then when we want to do 
seekBefore or other sth in StoreFileScanner,  StoreFileScanner.cur will call:
{code}
  public Cell getKeyValue() {
if (atEnd) return null;
{code}

The fix is simple, we need to reset the atEnd flag once seekBefore op is 
successful. You can undo my HalfStoreFileReader change and run the new added 
case to see NPE.
ps: I also found some cases in TestHRegion missing the @Test annotation, so 
fix them in the same patch as well.

  was:
we observed a NPE in our internal 0.94 branch prd env:

ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:409)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.backwardSeek(StoreFileScanner.java:467)
at 
org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.seekScanners(ReversedStoreScanner.java:78)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:154)
at 
org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.init(ReversedStoreScanner.java:49)
at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2216)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3984)
at 
org.apache.hadoop.hbase.regionserver.ReversedRegionScannerImpl.init(ReversedRegionScannerImpl.java:43)
at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1851)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1837)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1812)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.internalOpenScanner(HRegionServer.java:2752)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2715)

After tracing, it turned out a boundary bug at HalfStoreFileReader. If we have 
a reverse scan request at bottom part, once the start row = the last row at 
bottom part, the NPE will be encourtered, since we will invoke next(), and 
atEnd = true will be called in HalfStoreFileReader, then when we want to do 
seekBefore or other sth in StoreFileScanner,  StoreFileScanner.cur will call:
{code}
  public Cell getKeyValue() {
if (atEnd) return null;
{code}

The fix is simple, we need to reset the atEnd flag once seekBefore op is 
successful. You can undo my HalfStoreFileReader change and run the next case to 
see NPE.
ps: I also found some cases in TestHRegion missing the @Test annotation, so 
fix them in the same patch as well.


 Fix a StoreFileScanner NPE in reverse scan flow
 ---

 Key: HBASE-12767
 URL: https://issues.apache.org/jira/browse/HBASE-12767
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Scanners
Affects Versions: 1.0.0, 2.0.0, 0.98.9
Reporter: Liang Xie
Assignee: Liang Xie

 we observed a NPE in our internal 0.94 branch prd env:
 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
 

[jira] [Updated] (HBASE-12767) Fix a StoreFileScanner NPE in reverse scan flow

2014-12-28 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-12767:
--
Attachment: HBASE-12767.txt

 Fix a StoreFileScanner NPE in reverse scan flow
 ---

 Key: HBASE-12767
 URL: https://issues.apache.org/jira/browse/HBASE-12767
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Scanners
Affects Versions: 1.0.0, 2.0.0, 0.98.9
Reporter: Liang Xie
Assignee: Liang Xie
 Attachments: HBASE-12767.txt


 we observed a NPE in our internal 0.94 branch prd env:
 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:409)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.backwardSeek(StoreFileScanner.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.seekScanners(ReversedStoreScanner.java:78)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:154)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.init(ReversedStoreScanner.java:49)
 at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2216)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3984)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedRegionScannerImpl.init(ReversedRegionScannerImpl.java:43)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1851)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1837)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1812)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.internalOpenScanner(HRegionServer.java:2752)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2715)
 After tracing, it turned out a boundary bug at HalfStoreFileReader. If we 
 have a reverse scan request at bottom part, once the start row = the last 
 row at bottom part, the NPE will be encourtered, since we will invoke 
 next(), and atEnd = true will be called in HalfStoreFileReader, then when 
 we want to do seekBefore or other sth in StoreFileScanner,  
 StoreFileScanner.cur will call:
 {code}
   public Cell getKeyValue() {
 if (atEnd) return null;
 {code}
 The fix is simple, we need to reset the atEnd flag once seekBefore op is 
 successful. You can undo my HalfStoreFileReader change and run the new added 
 case to see NPE.
 ps: I also found some cases in TestHRegion missing the @Test annotation, so 
 fix them in the same patch as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12767) Fix a StoreFileScanner NPE in reverse scan flow

2014-12-28 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-12767:
--
Status: Patch Available  (was: Open)

 Fix a StoreFileScanner NPE in reverse scan flow
 ---

 Key: HBASE-12767
 URL: https://issues.apache.org/jira/browse/HBASE-12767
 Project: HBase
  Issue Type: Bug
  Components: regionserver, Scanners
Affects Versions: 0.98.9, 1.0.0, 2.0.0
Reporter: Liang Xie
Assignee: Liang Xie
 Attachments: HBASE-12767.txt


 we observed a NPE in our internal 0.94 branch prd env:
 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed openScanner
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:409)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.backwardSeek(StoreFileScanner.java:467)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.seekScanners(ReversedStoreScanner.java:78)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:154)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedStoreScanner.init(ReversedStoreScanner.java:49)
 at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2216)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3984)
 at 
 org.apache.hadoop.hbase.regionserver.ReversedRegionScannerImpl.init(ReversedRegionScannerImpl.java:43)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1851)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1837)
 at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1812)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.internalOpenScanner(HRegionServer.java:2752)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2715)
 After tracing, it turned out a boundary bug at HalfStoreFileReader. If we 
 have a reverse scan request at bottom part, once the start row = the last 
 row at bottom part, the NPE will be encourtered, since we will invoke 
 next(), and atEnd = true will be called in HalfStoreFileReader, then when 
 we want to do seekBefore or other sth in StoreFileScanner,  
 StoreFileScanner.cur will call:
 {code}
   public Cell getKeyValue() {
 if (atEnd) return null;
 {code}
 The fix is simple, we need to reset the atEnd flag once seekBefore op is 
 successful. You can undo my HalfStoreFileReader change and run the new added 
 case to see NPE.
 ps: I also found some cases in TestHRegion missing the @Test annotation, so 
 fix them in the same patch as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-4755) HBase based block placement in DFS

2014-12-22 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255597#comment-14255597
 ] 

Liang Xie commented on HBASE-4755:
--

[~devaraj], [~stack], if my understanding is correct, this JIRA could be set to 
resolved status?  i am just reading the related HDFS/HBase locality 
optimization jira and thinking to port a raw/simple stuff to our internal 0.94 
branch now:)

 HBase based block placement in DFS
 --

 Key: HBASE-4755
 URL: https://issues.apache.org/jira/browse/HBASE-4755
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.94.0
Reporter: Karthik Ranganathan
Assignee: Christopher Gist
Priority: Critical
 Attachments: 4755-wip-1.patch, hbase-4755-notes.txt


  The feature as is only useful for HBase clusters that care about data 
 locality on regionservers, but this feature can also enable a lot of nice 
 features down the road.
 The basic idea is as follows: instead of letting HDFS determine where to 
 replicate data (r=3) by place blocks on various regions, it is better to let 
 HBase do so by providing hints to HDFS through the DFS client. That way 
 instead of replicating data at a blocks level, we can replicate data at a 
 per-region level (each region owned by a promary, a secondary and a tertiary 
 regionserver). This is better for 2 things:
 - Can make region failover faster on clusters which benefit from data affinity
 - On large clusters with random block placement policy, this helps reduce the 
 probability of data loss
 The algo is as follows:
 - Each region in META will have 3 columns which are the preferred 
 regionservers for that region (primary, secondary and tertiary)
 - Preferred assignment can be controlled by a config knob
 - Upon cluster start, HMaster will enter a mapping from each region to 3 
 regionservers (random hash, could use current locality, etc)
 - The load balancer would assign out regions preferring region assignments to 
 primary over secondary over tertiary over any other node
 - Periodically (say weekly, configurable) the HMaster would run a locality 
 checked and make sure the map it has for region to regionservers is optimal.
 Down the road, this can be enhanced to control region placement in the 
 following cases:
 - Mixed hardware SKU where some regionservers can hold fewer regions
 - Load balancing across tables where we dont want multiple regions of a table 
 to get assigned to the same regionservers
 - Multi-tenancy, where we can restrict the assignment of the regions of some 
 table to a subset of regionservers, so an abusive app cannot take down the 
 whole HBase cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12739) Avoid too large identifier of ZooKeeperWatcher

2014-12-22 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-12739:
--
Fix Version/s: (was: 1.0.0)

 Avoid too large identifier of ZooKeeperWatcher
 --

 Key: HBASE-12739
 URL: https://issues.apache.org/jira/browse/HBASE-12739
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-12739-trunk-v1.diff


 For each SyncConnected event,  the ZooKeeperWatcher will append the session 
 id to its identity.
 During the failover of zk, the zookeeper client can connect to the zk server, 
 but the the zk server can not serve the request, so the client will try 
 continually, which will produce many SyncConnected events and a very large 
 identifier of ZooKeeperWatcher in hbase log.
 {code}
 2014-12-22,12:38:56,296 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
 master:16500-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-...
 {code}
 A simple patch to fix this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12739) Avoid too large identifier of ZooKeeperWatcher

2014-12-22 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-12739:
--
Affects Version/s: 0.94.11

 Avoid too large identifier of ZooKeeperWatcher
 --

 Key: HBASE-12739
 URL: https://issues.apache.org/jira/browse/HBASE-12739
 Project: HBase
  Issue Type: Bug
  Components: Zookeeper
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-12739-trunk-v1.diff


 For each SyncConnected event,  the ZooKeeperWatcher will append the session 
 id to its identity.
 During the failover of zk, the zookeeper client can connect to the zk server, 
 but the the zk server can not serve the request, so the client will try 
 continually, which will produce many SyncConnected events and a very large 
 identifier of ZooKeeperWatcher in hbase log.
 {code}
 2014-12-22,12:38:56,296 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
 master:16500-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-...
 {code}
 A simple patch to fix this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12739) Avoid too large identifier of ZooKeeperWatcher

2014-12-22 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-12739:
--
Component/s: Zookeeper

 Avoid too large identifier of ZooKeeperWatcher
 --

 Key: HBASE-12739
 URL: https://issues.apache.org/jira/browse/HBASE-12739
 Project: HBase
  Issue Type: Bug
  Components: Zookeeper
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-12739-trunk-v1.diff


 For each SyncConnected event,  the ZooKeeperWatcher will append the session 
 id to its identity.
 During the failover of zk, the zookeeper client can connect to the zk server, 
 but the the zk server can not serve the request, so the client will try 
 continually, which will produce many SyncConnected events and a very large 
 identifier of ZooKeeperWatcher in hbase log.
 {code}
 2014-12-22,12:38:56,296 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
 master:16500-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-...
 {code}
 A simple patch to fix this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12739) Avoid too large identifier of ZooKeeperWatcher

2014-12-22 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256566#comment-14256566
 ] 

Liang Xie commented on HBASE-12739:
---

+1.
I think the javadoc  release audit warning are not related?

 Avoid too large identifier of ZooKeeperWatcher
 --

 Key: HBASE-12739
 URL: https://issues.apache.org/jira/browse/HBASE-12739
 Project: HBase
  Issue Type: Bug
  Components: Zookeeper
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-12739-trunk-v1.diff


 For each SyncConnected event,  the ZooKeeperWatcher will append the session 
 id to its identity.
 During the failover of zk, the zookeeper client can connect to the zk server, 
 but the the zk server can not serve the request, so the client will try 
 continually, which will produce many SyncConnected events and a very large 
 identifier of ZooKeeperWatcher in hbase log.
 {code}
 2014-12-22,12:38:56,296 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
 master:16500-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-...
 {code}
 A simple patch to fix this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-2821) Keep young storefiles at lower replication

2014-08-12 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095103#comment-14095103
 ] 

Liang Xie commented on HBASE-2821:
--

Just a nit: How about setting to 2 as well if replicas  3, not just 
replicas--. It's not a big deal since i guess most of guys should set the 
default replica factor to 3:)

 Keep young storefiles at lower replication
 --

 Key: HBASE-2821
 URL: https://issues.apache.org/jira/browse/HBASE-2821
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Todd Lipcon
Assignee: Andrew Purtell
 Fix For: 2.0.0

 Attachments: HBASE-2821.patch, lifetime-distribution.png, 
 storefile_age.pl


 jgray and I were brainstorming some ideas about this:
 In a typical heavy-write scenario, many store files do not last very long. 
 They're flushed and then within a small number of seconds a compaction runs 
 and they get deleted. For these short lifetime store files, it's less 
 likely that a failure will occur during the window in which they're valid. 
 So, I think we can consider some optimizations like the following:
 - Flush files at replication count 2. Scan once a minute for any store files 
 in the region that are older than 2 minutes. If they're found, increase their 
 replication to 3. (alternatively, queue them to avoid scanning)
 - More dangerous: flush files at replication count 1, but don't count them 
 when figuring log expiration. So, if they get lost, we force log splitting to 
 recover.
 The performance gain here is that we avoid the network and disk transfer of 
 writing the third replica for a file that we're just about to delete anyway.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-7336) HFileBlock.readAtOffset does not work well with multiple threads

2014-07-16 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063209#comment-14063209
 ] 

Liang Xie commented on HBASE-7336:
--

Yes, i observed similar problem.  Long time ago i had a raw idea to impl a 
multi streams/readers prototype, maybe i can share the patch once ready:)

 HFileBlock.readAtOffset does not work well with multiple threads
 

 Key: HBASE-7336
 URL: https://issues.apache.org/jira/browse/HBASE-7336
 Project: HBase
  Issue Type: Sub-task
  Components: Performance
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
Priority: Critical
 Fix For: 0.94.4, 0.95.0

 Attachments: 7336-0.94.txt, 7336-0.96.txt


 HBase grinds to a halt when many threads scan along the same set of blocks 
 and neither read short circuit is nor block caching is enabled for the dfs 
 client ... disabling the block cache makes sense on very large scans.
 It turns out that synchronizing in istream in HFileBlock.readAtOffset is the 
 culprit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11492) The servers do not honor the tcpNoDelay option

2014-07-10 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058307#comment-14058307
 ] 

Liang Xie commented on HBASE-11492:
---

I have no more knowledge on this area, but i could not convince me to believe 
the original socket().setTcpNoDelay style did not take effect, since HDFS + 
HBase always use this style, right?  so weird, my god...

 The servers do not honor the tcpNoDelay option
 --

 Key: HBASE-11492
 URL: https://issues.apache.org/jira/browse/HBASE-11492
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.92.2, 0.98.0, 0.96.0, 0.99.0, 0.94.20
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
Priority: Critical
 Fix For: 0.99.0, 0.98.5

 Attachments: 11492.v1.patch


 There is an option to set tcpNoDelay, defaulted to true, but the socket 
 channel is actually not changed. As a consequence, the server works with 
 nagle enabled. This leads to very degraded behavior when a single connection 
 is shared between threads. We enter into conflicts with nagle and tcp delayed 
 ack. 
 Here is an example of performance with the PE tool plus HBASE-11491:
 {noformat}
 oneCon #client   sleep  exeTime (seconds) 
 avg latency, sleep excluded (microseconds)
 true   1   031
  310
 false  1   031
  310
 true   2   050
   500
 false  2   0   31 
  310
 true   25   488 (including 200s sleeping) 
   2880 
 false  2   5   246  (including 200s sleeping) 
   460
 {noformat}
 The latency is multiple by 5 (2880 vs 460) when the connection is shared. 
 This is the delayed ack kicking in. This can be fixed by really using tcp no 
 delay.
 Any application sharing the tcp connection between threads has the issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11355) a couple of callQueue related improvements

2014-06-24 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042989#comment-14042989
 ] 

Liang Xie commented on HBASE-11355:
---

weird...
1. a full in-memory random read test, right? 
2. How many ycsb processes and concurrent threads ?
3. Please ensure it's not a network saturated scenario(e.g. tuning ycsb setting 
like: readallfields, or small fieldcount/fieldlength in loading phase)
4. simple tens of thread dumps during testing could give more clues on current 
hotspot.

 a couple of callQueue related improvements
 --

 Key: HBASE-11355
 URL: https://issues.apache.org/jira/browse/HBASE-11355
 Project: HBase
  Issue Type: Improvement
  Components: IPC/RPC, Performance
Affects Versions: 0.99.0, 0.94.20
Reporter: Liang Xie
Assignee: Matteo Bertozzi
 Attachments: HBASE-11355-v0.patch


 In one of my in-memory read only testing(100% get requests), one of the top 
 scalibility bottleneck came from the single callQueue. A tentative sharing 
 this callQueue according to the rpc handler number showed a big throughput 
 improvement(the original get() qps is around 60k, after this one and other 
 hotspot tunning, i got 220k get() qps in the same single region server) in a 
 YCSB read only scenario.
 Another stuff we can do is seperating the queue into read call queue and 
 write call queue, we had done it in our internal branch, it would helpful in 
 some outages, to avoid all read or all write requests ran out of all handler 
 threads.
 One more stuff is changing the current blocking behevior once the callQueue 
 is full, considering the full callQueue almost means the backend processing 
 is slow somehow, so a fail-fast here should be more reasonable if we using 
 HBase as a low latency processing system. see callQueue.put(call)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11274) More general single-row Condition Mutation

2014-06-22 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-11274:
--

Assignee: Liu Shaohui

 More general single-row Condition Mutation
 --

 Key: HBASE-11274
 URL: https://issues.apache.org/jira/browse/HBASE-11274
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-11274-trunk-v1.diff


 Currently, the checkAndDelete and checkAndPut interface  only support atomic 
 mutation with single condition. But in actual apps, we need more general 
 condition-mutation that support multi conditions and logical expression with 
 those conditions.
 For example, to support the following sql
 {quote}
   insert row  where (column A == 'X' and column B == 'Y') or (column C == 'z')
 {quote}
 Suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11274) More general single-row Condition Mutation

2014-06-22 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-11274:
--

Priority: Major  (was: Minor)

 More general single-row Condition Mutation
 --

 Key: HBASE-11274
 URL: https://issues.apache.org/jira/browse/HBASE-11274
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
 Attachments: HBASE-11274-trunk-v1.diff


 Currently, the checkAndDelete and checkAndPut interface  only support atomic 
 mutation with single condition. But in actual apps, we need more general 
 condition-mutation that support multi conditions and logical expression with 
 those conditions.
 For example, to support the following sql
 {quote}
   insert row  where (column A == 'X' and column B == 'Y') or (column C == 'z')
 {quote}
 Suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11274) More general single-row Condition Mutation

2014-06-22 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-11274:
--

Affects Version/s: 1.0.0
   Status: Patch Available  (was: Open)

seems forgot to click Submit Patch :)

 More general single-row Condition Mutation
 --

 Key: HBASE-11274
 URL: https://issues.apache.org/jira/browse/HBASE-11274
 Project: HBase
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
 Attachments: HBASE-11274-trunk-v1.diff


 Currently, the checkAndDelete and checkAndPut interface  only support atomic 
 mutation with single condition. But in actual apps, we need more general 
 condition-mutation that support multi conditions and logical expression with 
 those conditions.
 For example, to support the following sql
 {quote}
   insert row  where (column A == 'X' and column B == 'Y') or (column C == 'z')
 {quote}
 Suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11355) a couple of callQueue related improvements

2014-06-16 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-11355:
--

Assignee: (was: Liang Xie)

 a couple of callQueue related improvements
 --

 Key: HBASE-11355
 URL: https://issues.apache.org/jira/browse/HBASE-11355
 Project: HBase
  Issue Type: Improvement
  Components: IPC/RPC
Affects Versions: 0.99.0, 0.94.20
Reporter: Liang Xie

 In one of my in-memory read only testing(100% get requests), one of the top 
 scalibility bottleneck came from the single callQueue. A tentative sharing 
 this callQueue according to the rpc handler number showed a big throughput 
 improvement(the original get() qps is around 60k, after this one and other 
 hotspot tunning, i got 220k get() qps in the same single region server) in a 
 YCSB read only scenario.
 Another stuff we can do is seperating the queue into read call queue and 
 write call queue, we had done it in our internal branch, it would helpful in 
 some outages, to avoid all read or all write requests ran out of all handler 
 threads.
 One more stuff is changing the current blocking behevior once the callQueue 
 is full, considering the full callQueue almost means the backend processing 
 is slow somehow, so a fail-fast here should be more reasonable if we using 
 HBase as a low latency processing system. see callQueue.put(call)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11346) Enable cacheonwrite for index and bloom blocks by default

2014-06-16 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032179#comment-14032179
 ] 

Liang Xie commented on HBASE-11346:
---

How to think: most of the new HFiles have a high probability to be compacted 
soon, so maybe this proposal will bring more block cache eviction ?

 Enable cacheonwrite for index and bloom blocks by default
 -

 Key: HBASE-11346
 URL: https://issues.apache.org/jira/browse/HBASE-11346
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.99.0, 0.98.4
Reporter: Nick Dimiduk

 Theory is, this will reduce the cost of compactions on a mixed workload 
 cluster, reducing the number of seeks for the first blocks from the new HFile 
 from ~4 to ~1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11355) a couple of callQueue related improvements

2014-06-16 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032181#comment-14032181
 ] 

Liang Xie commented on HBASE-11355:
---

 Unassigned from me. would be great if other guys pick it up, i am busy on 
other stuff.

 a couple of callQueue related improvements
 --

 Key: HBASE-11355
 URL: https://issues.apache.org/jira/browse/HBASE-11355
 Project: HBase
  Issue Type: Improvement
  Components: IPC/RPC
Affects Versions: 0.99.0, 0.94.20
Reporter: Liang Xie

 In one of my in-memory read only testing(100% get requests), one of the top 
 scalibility bottleneck came from the single callQueue. A tentative sharing 
 this callQueue according to the rpc handler number showed a big throughput 
 improvement(the original get() qps is around 60k, after this one and other 
 hotspot tunning, i got 220k get() qps in the same single region server) in a 
 YCSB read only scenario.
 Another stuff we can do is seperating the queue into read call queue and 
 write call queue, we had done it in our internal branch, it would helpful in 
 some outages, to avoid all read or all write requests ran out of all handler 
 threads.
 One more stuff is changing the current blocking behevior once the callQueue 
 is full, considering the full callQueue almost means the backend processing 
 is slow somehow, so a fail-fast here should be more reasonable if we using 
 HBase as a low latency processing system. see callQueue.put(call)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11355) a couple of callQueue related improvements

2014-06-16 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032201#comment-14032201
 ] 

Liang Xie commented on HBASE-11355:
---

Thank you, Matteo,  good on you:)

 a couple of callQueue related improvements
 --

 Key: HBASE-11355
 URL: https://issues.apache.org/jira/browse/HBASE-11355
 Project: HBase
  Issue Type: Improvement
  Components: IPC/RPC
Affects Versions: 0.99.0, 0.94.20
Reporter: Liang Xie
Assignee: Matteo Bertozzi

 In one of my in-memory read only testing(100% get requests), one of the top 
 scalibility bottleneck came from the single callQueue. A tentative sharing 
 this callQueue according to the rpc handler number showed a big throughput 
 improvement(the original get() qps is around 60k, after this one and other 
 hotspot tunning, i got 220k get() qps in the same single region server) in a 
 YCSB read only scenario.
 Another stuff we can do is seperating the queue into read call queue and 
 write call queue, we had done it in our internal branch, it would helpful in 
 some outages, to avoid all read or all write requests ran out of all handler 
 threads.
 One more stuff is changing the current blocking behevior once the callQueue 
 is full, considering the full callQueue almost means the backend processing 
 is slow somehow, so a fail-fast here should be more reasonable if we using 
 HBase as a low latency processing system. see callQueue.put(call)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-11359) A simple request timeout interrupt mechanism in HBase server side

2014-06-16 Thread Liang Xie (JIRA)
Liang Xie created HBASE-11359:
-

 Summary: A simple request timeout interrupt mechanism in HBase 
server side
 Key: HBASE-11359
 URL: https://issues.apache.org/jira/browse/HBASE-11359
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.94.20, 0.99.0
Reporter: Liang Xie


we had introduced a fail fast mechanism in RPC layer, see HBASE-10506.
But still, a slow latency problem comes from lower HDFS layer could make all 
HBase handler threads slow, sometimes those handler could hang several seconds. 
it's meanlingless to continue processing those read/write requests in the 
valuable rpc handler threads, especially the ones need to do a costly physical 
read operation or networking activity(write pipeline). A better solution should 
be similar with twitter MySQL branch: statement-timeout feature. I haven't 
taken time on figure out weather it's need to break compatibility or not in 
master, we are using a 0.94 branch, to me it will break if i adding a operation 
timeout field in every client rpc request.
so i added a simpler patch using the existing 
rpcCall.throwExceptionIfCallerDisconnected(), instrumenting it just before 
HLog sync and  DFSInputStream read/pread (@readAtOffset). 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11355) a couple of callQueue related improvements

2014-06-16 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033389#comment-14033389
 ] 

Liang Xie commented on HBASE-11355:
---

I don't have a normal 0.94 patch, it's a preliminary hack. Other hotspots 
includes: responseQueuesSizeThrottler, rpcMetrics, scannerReadPoints, etc.
The minor change about callQueue like below(we had seperated the original 
callQueue into readCallQueue and writeCallQueue):
{code}
-  protected BlockingQueueCall readCallQueue; // read queued calls
+  protected ListBlockingQueueCall readCallQueues; // read queued calls
...
-  boolean success = readCallQueue.offer(call);
+  boolean success = 
readCallQueues.get(rand.nextInt(readHandlerCount)).offer(call);
...
-this.readCallQueue = new LinkedBlockingQueueCall(readQueueLength);
+this.readHandlerCount = Math.round(readQueueRatio * handlerCount);
+this.readCallQueues = new LinkedListBlockingQueueCall();
+for (int i=0; i readHandlerCount; i++) {
+  readCallQueues.add(new LinkedBlockingQueueCall(readQueueLength)) ;
+}
{code}

Every handler thread will consume its own queue, to eliminate the severe 
contention.
If considering correctness or more resource consumption, another call queue 
sharding solution here probably is introducing a queue number setting(i just 
took handler number for simplify to get a raw perf number), and letting all 
requests from same client go to the same queue always.

 a couple of callQueue related improvements
 --

 Key: HBASE-11355
 URL: https://issues.apache.org/jira/browse/HBASE-11355
 Project: HBase
  Issue Type: Improvement
  Components: IPC/RPC
Affects Versions: 0.99.0, 0.94.20
Reporter: Liang Xie
Assignee: Matteo Bertozzi

 In one of my in-memory read only testing(100% get requests), one of the top 
 scalibility bottleneck came from the single callQueue. A tentative sharing 
 this callQueue according to the rpc handler number showed a big throughput 
 improvement(the original get() qps is around 60k, after this one and other 
 hotspot tunning, i got 220k get() qps in the same single region server) in a 
 YCSB read only scenario.
 Another stuff we can do is seperating the queue into read call queue and 
 write call queue, we had done it in our internal branch, it would helpful in 
 some outages, to avoid all read or all write requests ran out of all handler 
 threads.
 One more stuff is changing the current blocking behevior once the callQueue 
 is full, considering the full callQueue almost means the backend processing 
 is slow somehow, so a fail-fast here should be more reasonable if we using 
 HBase as a low latency processing system. see callQueue.put(call)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-11355) a couple of callQueue related improvements

2014-06-15 Thread Liang Xie (JIRA)
Liang Xie created HBASE-11355:
-

 Summary: a couple of callQueue related improvements
 Key: HBASE-11355
 URL: https://issues.apache.org/jira/browse/HBASE-11355
 Project: HBase
  Issue Type: Improvement
  Components: IPC/RPC
Affects Versions: 0.94.20, 0.99.0
Reporter: Liang Xie
Assignee: Liang Xie


In one of my in-memory read only testing(100% get requests), one of the top 
scalibility bottleneck came from the single callQueue. A tentative sharing this 
callQueue according to the rpc handler number showed a big throughput 
improvement(the original get() qps is around 60k, after this one and other 
hotspot tunning, i got 220k get() qps in the same single region server) in a 
YCSB read only scenario.
Another stuff we can do is seperating the queue into read call queue and write 
call queue, we had done it in our internal branch, it would helpful in some 
outages, to avoid all read or all write requests ran out of all handler threads.
One more stuff is changing the current blocking behevior once the callQueue is 
full, considering the full callQueue almost means the backend processing is 
slow somehow, so a fail-fast here should be more reasonable if we using HBase 
as a low latency processing system. see callQueue.put(call)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11297) Remove some synchros in the rpcServer responder

2014-06-10 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026199#comment-14026199
 ] 

Liang Xie commented on HBASE-11297:
---

yeh, confirmed no obvious benefit from another type testing.
In my previous testing which observed this doRespond hotspot, i employed single 
YCSB process with 64 threads. and saw lots of thread blocked at doRespond
then i reran with 64 YCSB process with 1 thread, those blocking had gone away, 
but still no obvious throughput improvement be found.

 Remove some synchros in the rpcServer responder
 ---

 Key: HBASE-11297
 URL: https://issues.apache.org/jira/browse/HBASE-11297
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.99.0
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
 Fix For: 0.99.0

 Attachments: 11297.v1.patch, 11297.v2.patch, 11297.v2.v98.patch


 This is on top of another patch that I'm going to put into another jira.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11297) Remove some synchros in the rpcServer responder

2014-06-10 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026217#comment-14026217
 ] 

Liang Xie commented on HBASE-11297:
---

no sharing.  i just want to see w/o that doRespond hotspot(it was confirm also 
from the thread dumps that no thread waiting on doRespond()), is there any 
throughput improvement or not:)

 Remove some synchros in the rpcServer responder
 ---

 Key: HBASE-11297
 URL: https://issues.apache.org/jira/browse/HBASE-11297
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.99.0
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
 Fix For: 0.99.0

 Attachments: 11297.v1.patch, 11297.v2.patch, 11297.v2.v98.patch


 This is on top of another patch that I'm going to put into another jira.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11297) Remove some synchros in the rpcServer responder

2014-06-05 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018666#comment-14018666
 ] 

Liang Xie commented on HBASE-11297:
---

will look at the patch, since i observed this 
{code}
synchronized (call.connection.responseQueue) 
{code}
hotspot during one of my read-only testing

 Remove some synchros in the rpcServer responder
 ---

 Key: HBASE-11297
 URL: https://issues.apache.org/jira/browse/HBASE-11297
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.99.0
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
 Fix For: 0.99.0

 Attachments: 11297.v1.patch


 This is on top of another patch that I'm going to put into another jira.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11232) Region fail to release the updatelock for illegal CF in multi row mutations

2014-05-21 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-11232:
--

Fix Version/s: (was: 0.99.0)

 Region fail to release the updatelock for illegal CF in multi row mutations
 ---

 Key: HBASE-11232
 URL: https://issues.apache.org/jira/browse/HBASE-11232
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Liu Shaohui
Assignee: Liu Shaohui
 Attachments: HBASE-11232-0.94.diff


 The failback code in processRowsWithLocks did not check the column family. If 
 there is an illegal CF in the muation, it will  throw NullPointException and 
 the update lock will not be released.  So the region can not be flushed and 
 compacted. 
 HRegion #4946
 {code}
 if (!mutations.isEmpty()  !walSyncSuccessful) {
   LOG.warn(Wal sync failed. Roll back  + mutations.size() +
memstore keyvalues for row(s): +
   processor.getRowsToLock().iterator().next() + ...);
   for (KeyValue kv : mutations) {
 stores.get(kv.getFamily()).rollback(kv);
   }
 }
 // 11. Roll mvcc forward
 if (writeEntry != null) {
   mvcc.completeMemstoreInsert(writeEntry);
   writeEntry = null;
 }
 if (locked) {
   this.updatesLock.readLock().unlock();
   locked = false;
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11232) Region fail to release the updatelock for illegal CF in multi row mutations

2014-05-21 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-11232:
--

Status: Patch Available  (was: Open)

How about 0.96 and 0.98 branch ?

 Region fail to release the updatelock for illegal CF in multi row mutations
 ---

 Key: HBASE-11232
 URL: https://issues.apache.org/jira/browse/HBASE-11232
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.19
Reporter: Liu Shaohui
Assignee: Liu Shaohui
 Attachments: HBASE-11232-0.94.diff


 The failback code in processRowsWithLocks did not check the column family. If 
 there is an illegal CF in the muation, it will  throw NullPointException and 
 the update lock will not be released.  So the region can not be flushed and 
 compacted. 
 HRegion #4946
 {code}
 if (!mutations.isEmpty()  !walSyncSuccessful) {
   LOG.warn(Wal sync failed. Roll back  + mutations.size() +
memstore keyvalues for row(s): +
   processor.getRowsToLock().iterator().next() + ...);
   for (KeyValue kv : mutations) {
 stores.get(kv.getFamily()).rollback(kv);
   }
 }
 // 11. Roll mvcc forward
 if (writeEntry != null) {
   mvcc.completeMemstoreInsert(writeEntry);
   writeEntry = null;
 }
 if (locked) {
   this.updatesLock.readLock().unlock();
   locked = false;
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11232) Region fail to release the updatelock for illegal CF in multi row mutations

2014-05-21 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-11232:
--

Affects Version/s: 0.94.19

 Region fail to release the updatelock for illegal CF in multi row mutations
 ---

 Key: HBASE-11232
 URL: https://issues.apache.org/jira/browse/HBASE-11232
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.19
Reporter: Liu Shaohui
Assignee: Liu Shaohui
 Attachments: HBASE-11232-0.94.diff


 The failback code in processRowsWithLocks did not check the column family. If 
 there is an illegal CF in the muation, it will  throw NullPointException and 
 the update lock will not be released.  So the region can not be flushed and 
 compacted. 
 HRegion #4946
 {code}
 if (!mutations.isEmpty()  !walSyncSuccessful) {
   LOG.warn(Wal sync failed. Roll back  + mutations.size() +
memstore keyvalues for row(s): +
   processor.getRowsToLock().iterator().next() + ...);
   for (KeyValue kv : mutations) {
 stores.get(kv.getFamily()).rollback(kv);
   }
 }
 // 11. Roll mvcc forward
 if (writeEntry != null) {
   mvcc.completeMemstoreInsert(writeEntry);
   writeEntry = null;
 }
 if (locked) {
   this.updatesLock.readLock().unlock();
   locked = false;
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10561) Forward port: HBASE-10212 New rpc metric: number of active handler

2014-05-15 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10561:
--

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Integreted into trunk, 0.98 and 0.96 branches.
There is a little difference in each branch, so the final committed stuff was 
not exactly same with the original patch, just FYI

 Forward port: HBASE-10212 New rpc metric: number of active handler
 --

 Key: HBASE-10561
 URL: https://issues.apache.org/jira/browse/HBASE-10561
 Project: HBase
  Issue Type: Sub-task
  Components: IPC/RPC
Reporter: Lars Hofhansl
Assignee: Liang Xie
 Fix For: 0.99.0, 0.96.3, 0.98.3

 Attachments: HBASE-10561.txt


 The metrics implementation has changed a lot in 0.96.
 Forward port HBASE-10212 to 0.96 and later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10561) Forward port: HBASE-10212 New rpc metric: number of active handler

2014-05-14 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10561:
--

Status: Patch Available  (was: Open)

 Forward port: HBASE-10212 New rpc metric: number of active handler
 --

 Key: HBASE-10561
 URL: https://issues.apache.org/jira/browse/HBASE-10561
 Project: HBase
  Issue Type: Sub-task
  Components: IPC/RPC
Reporter: Lars Hofhansl
Assignee: Liang Xie
 Fix For: 0.99.0, 0.96.3, 0.98.3

 Attachments: HBASE-10561.txt


 The metrics implementation has changed a lot in 0.96.
 Forward port HBASE-10212 to 0.96 and later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10561) Forward port: HBASE-10212 New rpc metric: number of active handler

2014-05-14 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996092#comment-13996092
 ] 

Liang Xie commented on HBASE-10561:
---

Thank you [~mbertozzi], [~stepinto] for your review.

 Forward port: HBASE-10212 New rpc metric: number of active handler
 --

 Key: HBASE-10561
 URL: https://issues.apache.org/jira/browse/HBASE-10561
 Project: HBase
  Issue Type: Sub-task
  Components: IPC/RPC
Reporter: Lars Hofhansl
Assignee: Liang Xie
 Fix For: 0.99.0, 0.96.3, 0.98.3

 Attachments: HBASE-10561.txt


 The metrics implementation has changed a lot in 0.96.
 Forward port HBASE-10212 to 0.96 and later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10052) use HDFS advisory caching to avoid caching HFiles that are not going to be read again (because they are being compacted)

2014-05-13 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993417#comment-13993417
 ] 

Liang Xie commented on HBASE-10052:
---

bq.  One thing to be wary of: during the compaction, readers are still 
accessing the old files, so if you're compacting large files, this could really 
hurt read latency during compactions (assuming that people are relying on linux 
LRU in addition to hbase-internal LRU for performance).
Since by default we has 3 replicas in HDFS layer, the current InputStream drops 
caching against the only 1 picked up replica, seems not ideal considering the 
possible redundant caching on multi nodes if a failover or sth happened. How 
about providing an async function in InputStream layer, say dropFileCaches, 
getting all LocatedBlocks, and expose a similar interface in dn layer as well, 
then clear all caching in all dns for those blocks.
we can request this async dropFileCaches just before closing the original store 
files be compacted.  Just a raw idea, crazy? :)


 use HDFS advisory caching to avoid caching HFiles that are not going to be 
 read again (because they are being compacted)
 

 Key: HBASE-10052
 URL: https://issues.apache.org/jira/browse/HBASE-10052
 Project: HBase
  Issue Type: Improvement
Reporter: Colin Patrick McCabe
Priority: Minor
 Fix For: 0.99.0, 0.98.3


 HBase can benefit from doing dropbehind during compaction since compacted 
 files are not read again.  HDFS advisory caching, introduced in HDFS-4817, 
 can help here.  The right API here is {{DataInputStream#setDropBehind}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10561) Forward port: HBASE-10212 New rpc metric: number of active handler

2014-05-11 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993471#comment-13993471
 ] 

Liang Xie commented on HBASE-10561:
---

How about this one.
[~stepinto] , it would better if you have chance to take a look at the attached 
patch as well, since you're the original rpc scheduler and HBASE-10212's 
author:)

 Forward port: HBASE-10212 New rpc metric: number of active handler
 --

 Key: HBASE-10561
 URL: https://issues.apache.org/jira/browse/HBASE-10561
 Project: HBase
  Issue Type: Sub-task
  Components: IPC/RPC
Reporter: Lars Hofhansl
Assignee: Liang Xie
 Fix For: 0.99.0, 0.96.3, 0.98.3

 Attachments: HBASE-10561.txt


 The metrics implementation has changed a lot in 0.96.
 Forward port HBASE-10212 to 0.96 and later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (HBASE-10561) Forward port: HBASE-10212 New rpc metric: number of active handler

2014-05-11 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie reassigned HBASE-10561:
-

Assignee: Liang Xie

 Forward port: HBASE-10212 New rpc metric: number of active handler
 --

 Key: HBASE-10561
 URL: https://issues.apache.org/jira/browse/HBASE-10561
 Project: HBase
  Issue Type: Sub-task
  Components: IPC/RPC
Reporter: Lars Hofhansl
Assignee: Liang Xie
 Fix For: 0.99.0, 0.96.3, 0.98.3

 Attachments: HBASE-10561.txt


 The metrics implementation has changed a lot in 0.96.
 Forward port HBASE-10212 to 0.96 and later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10561) Forward port: HBASE-10212 New rpc metric: number of active handler

2014-05-11 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13994750#comment-13994750
 ] 

Liang Xie commented on HBASE-10561:
---

I'll commit it tomorrow unless any objection.

 Forward port: HBASE-10212 New rpc metric: number of active handler
 --

 Key: HBASE-10561
 URL: https://issues.apache.org/jira/browse/HBASE-10561
 Project: HBase
  Issue Type: Sub-task
  Components: IPC/RPC
Reporter: Lars Hofhansl
Assignee: Liang Xie
 Fix For: 0.99.0, 0.96.3, 0.98.3

 Attachments: HBASE-10561.txt


 The metrics implementation has changed a lot in 0.96.
 Forward port HBASE-10212 to 0.96 and later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10561) Forward port: HBASE-10212 New rpc metric: number of active handler

2014-05-10 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10561:
--

Attachment: HBASE-10561.txt

 Forward port: HBASE-10212 New rpc metric: number of active handler
 --

 Key: HBASE-10561
 URL: https://issues.apache.org/jira/browse/HBASE-10561
 Project: HBase
  Issue Type: Sub-task
  Components: IPC/RPC
Reporter: Lars Hofhansl
 Fix For: 0.99.0, 0.96.3, 0.98.3

 Attachments: HBASE-10561.txt


 The metrics implementation has changed a lot in 0.96.
 Forward port HBASE-10212 to 0.96 and later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10999) Cross-row Transaction : Implement Percolator Algorithm on HBase

2014-04-16 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10999:
--

Assignee: cuijianwei

 Cross-row Transaction : Implement Percolator Algorithm on HBase
 ---

 Key: HBASE-10999
 URL: https://issues.apache.org/jira/browse/HBASE-10999
 Project: HBase
  Issue Type: New Feature
  Components: Transactions/MVCC
Affects Versions: 0.94.18
Reporter: cuijianwei
Assignee: cuijianwei

 Cross-row transaction is a desired function for database. It is not easy to 
 keep ACID characteristics of cross-row transactions in distribute databases 
 such as HBase, because data of cross-transaction might locate in different 
 machines. In the paper http://research.google.com/pubs/pub36726.html, google 
 presents an algorithm(named percolator) to implement cross-row transactions 
 on BigTable. After analyzing the algorithm, we found percolator might also be 
 a choice to provide cross-row transaction on HBase. The reasons includes:
 1. Percolator could keep the ACID of cross-row transaction as described in 
 google's paper. Percolator depends on a Global Incremental Timestamp Service 
 to define the order of transactions, this is important to keep ACID of 
 transaction.
 2. Percolator algorithm could be totally implemented in client-side. This 
 means we do not need to change the logic of server side. Users could easily 
 include percolator in their client and adopt percolator APIs only when they 
 want cross-row transaction.
 3. Percolator is a general algorithm which could be implemented based on 
 databases providing single-row transaction. Therefore, it is feasible to 
 implement percolator on HBase.
 In last few months, we have implemented percolator on HBase, did correctness 
 validation, performance test and finally successfully applied this algorithm 
 in our production environment. Our works include:
 1. percolator algorithm implementation on HBase. The current implementations 
 includes:
 a). a Transaction module to provides put/delete/get/scan interfaces to do 
 cross-row/cross-table transaction.
 b). a Global Incremental Timestamp Server to provide globally 
 monotonically increasing timestamp for transaction.
 c). a LockCleaner module to resolve conflict when concurrent transactions 
 mutate the same column.
 d). an internal module to implement prewrite/commit/get/scan logic of 
 percolator.
Although percolator logic could be totally implemented in client-side, we 
 use coprocessor framework of HBase in our implementation. This is because 
 coprocessor could provide percolator-specific Rpc interfaces such as 
 prewrite/commit to reduce Rpc rounds and improve efficiency. Another reason 
 to use coprocessor is that we want to decouple percolator's code from HBase 
 so that users will get clean HBase code if they don't need cross-row 
 transactions. In future, we will also explore the concurrent running 
 characteristic of coprocessor to do cross-row mutations more efficiently.
 2. an AccountTransfer simulation program to validate the correctness of 
 implementation. This program will distribute initial values in different 
 tables, rows and columns in HBase. Each column represents an account. Then, 
 configured client threads will be concurrently started to read out a number 
 of account values from different tables and rows by percolator's get; after 
 this, clients will randomly transfer values among these accounts while 
 keeping the sum unchanged, which simulates concurrent cross-table/cross-row 
 transactions. To check the correctness of transactions, a checker thread will 
 periodically scan account values from all columns, make sure the current 
 total value is the same as the initial total value. We run this validation 
 program while developing, this help us correct errors of implementation.
 3. performance evaluation under various test situations. We compared 
 percolator's APIs with HBase's with different data size and client thread 
 count for single-column transaction which represents the worst performance 
 case for percolator. We get the performance comparison result as (below):
 a) For read, the performance of percolator is 90% of HBase;
 b) For write, the performance of percolator is 23%  of HBase.
 The drop derives from the overhead of percolator logic, the performance test 
 result is similar as the result reported by google's paper.
 4. Performance improvement. The write performance of percolator decreases 
 more compared with HBase. This is because percolator's write needs to read 
 data out to check write conflict and needs two Rpcs which do prewriting and 
 commiting respectively. We are investigating ways to improve the write 
 performance.
 We are glad to share current percolator implementation 

[jira] [Updated] (HBASE-10999) Cross-row Transaction : Implement Percolator Algorithm on HBase

2014-04-16 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10999:
--

Affects Version/s: (was: 0.94.18)
   0.99.0

 Cross-row Transaction : Implement Percolator Algorithm on HBase
 ---

 Key: HBASE-10999
 URL: https://issues.apache.org/jira/browse/HBASE-10999
 Project: HBase
  Issue Type: New Feature
  Components: Transactions/MVCC
Affects Versions: 0.99.0
Reporter: cuijianwei
Assignee: cuijianwei

 Cross-row transaction is a desired function for database. It is not easy to 
 keep ACID characteristics of cross-row transactions in distribute databases 
 such as HBase, because data of cross-transaction might locate in different 
 machines. In the paper http://research.google.com/pubs/pub36726.html, google 
 presents an algorithm(named percolator) to implement cross-row transactions 
 on BigTable. After analyzing the algorithm, we found percolator might also be 
 a choice to provide cross-row transaction on HBase. The reasons includes:
 1. Percolator could keep the ACID of cross-row transaction as described in 
 google's paper. Percolator depends on a Global Incremental Timestamp Service 
 to define the order of transactions, this is important to keep ACID of 
 transaction.
 2. Percolator algorithm could be totally implemented in client-side. This 
 means we do not need to change the logic of server side. Users could easily 
 include percolator in their client and adopt percolator APIs only when they 
 want cross-row transaction.
 3. Percolator is a general algorithm which could be implemented based on 
 databases providing single-row transaction. Therefore, it is feasible to 
 implement percolator on HBase.
 In last few months, we have implemented percolator on HBase, did correctness 
 validation, performance test and finally successfully applied this algorithm 
 in our production environment. Our works include:
 1. percolator algorithm implementation on HBase. The current implementations 
 includes:
 a). a Transaction module to provides put/delete/get/scan interfaces to do 
 cross-row/cross-table transaction.
 b). a Global Incremental Timestamp Server to provide globally 
 monotonically increasing timestamp for transaction.
 c). a LockCleaner module to resolve conflict when concurrent transactions 
 mutate the same column.
 d). an internal module to implement prewrite/commit/get/scan logic of 
 percolator.
Although percolator logic could be totally implemented in client-side, we 
 use coprocessor framework of HBase in our implementation. This is because 
 coprocessor could provide percolator-specific Rpc interfaces such as 
 prewrite/commit to reduce Rpc rounds and improve efficiency. Another reason 
 to use coprocessor is that we want to decouple percolator's code from HBase 
 so that users will get clean HBase code if they don't need cross-row 
 transactions. In future, we will also explore the concurrent running 
 characteristic of coprocessor to do cross-row mutations more efficiently.
 2. an AccountTransfer simulation program to validate the correctness of 
 implementation. This program will distribute initial values in different 
 tables, rows and columns in HBase. Each column represents an account. Then, 
 configured client threads will be concurrently started to read out a number 
 of account values from different tables and rows by percolator's get; after 
 this, clients will randomly transfer values among these accounts while 
 keeping the sum unchanged, which simulates concurrent cross-table/cross-row 
 transactions. To check the correctness of transactions, a checker thread will 
 periodically scan account values from all columns, make sure the current 
 total value is the same as the initial total value. We run this validation 
 program while developing, this help us correct errors of implementation.
 3. performance evaluation under various test situations. We compared 
 percolator's APIs with HBase's with different data size and client thread 
 count for single-column transaction which represents the worst performance 
 case for percolator. We get the performance comparison result as (below):
 a) For read, the performance of percolator is 90% of HBase;
 b) For write, the performance of percolator is 23%  of HBase.
 The drop derives from the overhead of percolator logic, the performance test 
 result is similar as the result reported by google's paper.
 4. Performance improvement. The write performance of percolator decreases 
 more compared with HBase. This is because percolator's write needs to read 
 data out to check write conflict and needs two Rpcs which do prewriting and 
 commiting respectively. We are investigating ways to improve the write 
 performance.
 We are 

[jira] [Commented] (HBASE-10948) Fix hbase table file 'x' mode

2014-04-15 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13970413#comment-13970413
 ] 

Liang Xie commented on HBASE-10948:
---

Totally agreed with that there is a security issue here. IIRC, in our internal 
clusters, [~liushaohui] set fs.permissions.umask-mode explicItly to avoid 
this flaw long long ago:)
That's means, seems we have two options at least: 1) code fix, like Jerry's 
2)w/o code modification, just adding more safer perm setting in the default 
config file.  thought?

 Fix hbase table file 'x' mode
 -

 Key: HBASE-10948
 URL: https://issues.apache.org/jira/browse/HBASE-10948
 Project: HBase
  Issue Type: Bug
  Components: Filesystem Integration
Affects Versions: 0.96.2, 0.98.1
Reporter: Jerry He
Assignee: Jerry He
 Fix For: 0.99.0

 Attachments: HBASE-10948-trunk-v2.patch, HBASE-10948-trunk.patch


 The hbase table files currently all have 'x' mode in there:
 {code}
 $hadoop fs -ls -R /hbase/data/default/TestTable/
 drwxr-xr-x   - hbase biadmin  0 2014-04-08 20:53 
 /hbase/data/default/TestTable/.tabledesc
 -rw-r--r--   1 hbase biadmin313 2014-04-08 20:53 
 /hbase/data/default/TestTable/.tabledesc/.tableinfo.01
 drwxr-xr-x   - hbase biadmin  0 2014-04-08 20:53 
 /hbase/data/default/TestTable/724c8c3075da516b450ce4826327ce64
 -rwxr-xr-x   1 hbase biadmin 68 2014-04-08 20:53 
 /hbase/data/default/TestTable/724c8c3075da516b450ce4826327ce64/.regioninfo
 drwxr-xr-x   - hbase biadmin  0 2014-04-08 21:54 
 /hbase/data/default/TestTable/724c8c3075da516b450ce4826327ce64/info
 -rwxr-xr-x   1 hbase biadmin  272958577 2014-04-08 20:53 
 /hbase/data/default/TestTable/724c8c3075da516b450ce4826327ce64/info/7138e61cbcd24538b64726db13dab48e
 -rwxr-xr-x   1 hbase biadmin  108603714 2014-04-08 20:53 
 /hbase/data/default/TestTable/724c8c3075da516b450ce4826327ce64/info/9ce233fcdfde49679797d13f28e26129
 drwxr-xr-x   - hbase biadmin  0 2014-04-08 20:53 
 /hbase/data/default/TestTable/b5350c581363f786e49ff6f32e71f564
 -rwxr-xr-x   1 hbase biadmin 68 2014-04-08 20:53 
 /hbase/data/default/TestTable/b5350c581363f786e49ff6f32e71f564/.regioninfo
 drwxr-xr-x   - hbase biadmin  0 2014-04-08 21:54 
 /hbase/data/default/TestTable/b5350c581363f786e49ff6f32e71f564/info
 -rwxr-xr-x   1 hbase biadmin   33800049 2014-04-08 21:54 
 /hbase/data/default/TestTable/b5350c581363f786e49ff6f32e71f564/info/576190de431341b9a02280654e3deb58
 -rwxr-xr-x   1 hbase biadmin  108650474 2014-04-08 20:53 
 /hbase/data/default/TestTable/b5350c581363f786e49ff6f32e71f564/info/7c54098fb62a4ef29aab0f5658b25260
 {code}
 If the user does not specify 'hbase.data.umask', we use the fs default:
 FsPermission.getDefault()
 Instead we should use FsPermission.getFileDefault().



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10993) Deprioritize long-running scanners

2014-04-15 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13970452#comment-13970452
 ] 

Liang Xie commented on HBASE-10993:
---

+100 for this idea ! 
btw:  we still need to introduce a hint in future such that if end user want to 
control the priority, right?

 Deprioritize long-running scanners
 --

 Key: HBASE-10993
 URL: https://issues.apache.org/jira/browse/HBASE-10993
 Project: HBase
  Issue Type: Sub-task
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Minor
 Fix For: 1.0.0

 Attachments: HBASE-10993-v0.patch


 Currently we have a single call queue that serves all the normal user  
 requests, and the requests are executed in FIFO.
 When running map-reduce jobs and user-queries on the same machine, we want to 
 prioritize the user-queries.
 Without changing too much code, and not having the user giving hints, we can 
 add a “vtime” field to the scanner, to keep track from how long is running. 
 And we can replace the callQueue with a priorityQueue. In this way we can 
 deprioritize long-running scans, the longer a scan request lives the less 
 priority it gets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10943) Backport HBASE-7329 to 0.94

2014-04-10 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10943:
--

Assignee: Liu Shaohui

 Backport HBASE-7329 to 0.94
 ---

 Key: HBASE-10943
 URL: https://issues.apache.org/jira/browse/HBASE-10943
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.94.18
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10943-0.94-v1.diff


 See HBASE-7329



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10943) Backport HBASE-7329 to 0.94

2014-04-10 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10943:
--

Status: Patch Available  (was: Open)

 Backport HBASE-7329 to 0.94
 ---

 Key: HBASE-10943
 URL: https://issues.apache.org/jira/browse/HBASE-10943
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.94.18
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10943-0.94-v1.diff


 See HBASE-7329



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10881) Support reverse scan in thrift2

2014-04-03 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959578#comment-13959578
 ] 

Liang Xie commented on HBASE-10881:
---

Confirm it works, after talking with shaohui. will commit v2 shortly


 Support reverse scan in thrift2
 ---

 Key: HBASE-10881
 URL: https://issues.apache.org/jira/browse/HBASE-10881
 Project: HBase
  Issue Type: New Feature
  Components: Thrift
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.99.0

 Attachments: HBASE-10881-trunk-v1.diff, HBASE-10881-trunk-v2.diff


 Support reverse scan in thrift2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10881) Support reverse scan in thrift2

2014-04-03 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959584#comment-13959584
 ] 

Liang Xie commented on HBASE-10881:
---

Integrated into trunk, thanks [~lshmouse] for your patch, and thanks all for 
review !

 Support reverse scan in thrift2
 ---

 Key: HBASE-10881
 URL: https://issues.apache.org/jira/browse/HBASE-10881
 Project: HBase
  Issue Type: New Feature
  Components: Thrift
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.99.0

 Attachments: HBASE-10881-trunk-v1.diff, HBASE-10881-trunk-v2.diff


 Support reverse scan in thrift2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10881) Support reverse scan in thrift2

2014-04-03 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10881:
--

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

 Support reverse scan in thrift2
 ---

 Key: HBASE-10881
 URL: https://issues.apache.org/jira/browse/HBASE-10881
 Project: HBase
  Issue Type: New Feature
  Components: Thrift
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.99.0

 Attachments: HBASE-10881-trunk-v1.diff, HBASE-10881-trunk-v2.diff


 Support reverse scan in thrift2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10881) Support reverse scan in thrift2

2014-04-01 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956248#comment-13956248
 ] 

Liang Xie commented on HBASE-10881:
---

[~tedyu] [~ram_krish] seems still had a complain even HBASE-10824 committed. 
mind opening a new jira to track it?

 Support reverse scan in thrift2
 ---

 Key: HBASE-10881
 URL: https://issues.apache.org/jira/browse/HBASE-10881
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10881-trunk-v1.diff


 Support reverse scan in thrift2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10881) Support reverse scan in thrift2

2014-04-01 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10881:
--

Component/s: Thrift

 Support reverse scan in thrift2
 ---

 Key: HBASE-10881
 URL: https://issues.apache.org/jira/browse/HBASE-10881
 Project: HBase
  Issue Type: New Feature
  Components: Thrift
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.99.0

 Attachments: HBASE-10881-trunk-v1.diff


 Support reverse scan in thrift2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10881) Support reverse scan in thrift2

2014-04-01 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10881:
--

Fix Version/s: 0.99.0

 Support reverse scan in thrift2
 ---

 Key: HBASE-10881
 URL: https://issues.apache.org/jira/browse/HBASE-10881
 Project: HBase
  Issue Type: New Feature
  Components: Thrift
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.99.0

 Attachments: HBASE-10881-trunk-v1.diff


 Support reverse scan in thrift2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10881) Support reverse scan in thrift2

2014-04-01 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956249#comment-13956249
 ] 

Liang Xie commented on HBASE-10881:
---

+1 for attached patch

 Support reverse scan in thrift2
 ---

 Key: HBASE-10881
 URL: https://issues.apache.org/jira/browse/HBASE-10881
 Project: HBase
  Issue Type: New Feature
  Components: Thrift
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.99.0

 Attachments: HBASE-10881-trunk-v1.diff


 Support reverse scan in thrift2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10881) Support reverse scan in thrift2

2014-04-01 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956263#comment-13956263
 ] 

Liang Xie commented on HBASE-10881:
---

[~liushaohui], just realized the old thrift(maybe we can say thrift1?) also 
doesn't support reverse scan in current codebase, could you enhance it if get 
any change? thks!

 Support reverse scan in thrift2
 ---

 Key: HBASE-10881
 URL: https://issues.apache.org/jira/browse/HBASE-10881
 Project: HBase
  Issue Type: New Feature
  Components: Thrift
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.99.0

 Attachments: HBASE-10881-trunk-v1.diff


 Support reverse scan in thrift2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-10887) tidy ThriftUtilities format

2014-04-01 Thread Liang Xie (JIRA)
Liang Xie created HBASE-10887:
-

 Summary: tidy ThriftUtilities format
 Key: HBASE-10887
 URL: https://issues.apache.org/jira/browse/HBASE-10887
 Project: HBase
  Issue Type: Improvement
  Components: Thrift
Affects Versions: 0.99.0
Reporter: Liang Xie
Assignee: Liang Xie
Priority: Trivial
 Fix For: 0.99.0
 Attachments: HBASE-10887.txt

Just found the weird code format during reviewing another patch, let's remove 
the unneccessary tab



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10887) tidy ThriftUtilities format

2014-04-01 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10887:
--

Attachment: HBASE-10887.txt

will commit it shortly, due to just a dummy change

 tidy ThriftUtilities format
 ---

 Key: HBASE-10887
 URL: https://issues.apache.org/jira/browse/HBASE-10887
 Project: HBase
  Issue Type: Improvement
  Components: Thrift
Affects Versions: 0.99.0
Reporter: Liang Xie
Assignee: Liang Xie
Priority: Trivial
 Fix For: 0.99.0

 Attachments: HBASE-10887.txt


 Just found the weird code format during reviewing another patch, let's remove 
 the unneccessary tab



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10887) tidy ThriftUtilities format

2014-04-01 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10887:
--

Fix Version/s: 0.98.1
Affects Version/s: 0.98.1
   Status: Patch Available  (was: Open)

 tidy ThriftUtilities format
 ---

 Key: HBASE-10887
 URL: https://issues.apache.org/jira/browse/HBASE-10887
 Project: HBase
  Issue Type: Improvement
  Components: Thrift
Affects Versions: 0.98.1, 0.99.0
Reporter: Liang Xie
Assignee: Liang Xie
Priority: Trivial
 Fix For: 0.99.0, 0.98.1

 Attachments: HBASE-10887.txt


 Just found the weird code format during reviewing another patch, let's remove 
 the unneccessary tab



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10887) tidy ThriftUtilities format

2014-04-01 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10887:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 tidy ThriftUtilities format
 ---

 Key: HBASE-10887
 URL: https://issues.apache.org/jira/browse/HBASE-10887
 Project: HBase
  Issue Type: Improvement
  Components: Thrift
Affects Versions: 0.98.1, 0.99.0
Reporter: Liang Xie
Assignee: Liang Xie
Priority: Trivial
 Fix For: 0.99.0, 0.98.1

 Attachments: HBASE-10887.txt


 Just found the weird code format during reviewing another patch, let's remove 
 the unneccessary tab



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10882) Bulkload process hangs on regions randomly and finally throws RegionTooBusyException

2014-04-01 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13956398#comment-13956398
 ] 

Liang Xie commented on HBASE-10882:
---

It would be better to ask at mail list in future:)
Why you have such many table pool instances ?
bq. The most ridiculous thing is NO ONE OWNED THE LOCK! I searched the jstack 
output carefully, but cannot find any process who claimed to own the lock.
The thread dump doesn't show the lock holder while using Lock,  but you could 
see it if using Synchronized.:)

 Bulkload process hangs on regions randomly and finally throws 
 RegionTooBusyException
 

 Key: HBASE-10882
 URL: https://issues.apache.org/jira/browse/HBASE-10882
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.10
 Environment: rhel 5.6, jdk1.7.0_45, hadoop-2.2.0-cdh5.0.0
Reporter: Victor Xu
 Attachments: jstack_5105.log


 I came across the problem in the early morning several days ago. It happened 
 when I used hadoop completebulkload command to bulk load some hdfs files into 
 hbase table. Several regions hung and after retried three times they all 
 threw RegionTooBusyExceptions. Fortunately, I caught one of the exceptional 
 region’s HRegionServer process’s jstack info just in time.
 I found that the bulkload process was waiting for a write lock:
 at 
 java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:1115)
 The lock id is 0x0004054ecbf0.
 In the meantime, many other Get/Scan operations were also waiting for the 
 same lock id. And, of course, they were waiting for the read lock:
 at 
 java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:873)
 The most ridiculous thing is NO ONE OWNED THE LOCK! I searched the jstack 
 output carefully, but cannot find any process who claimed to own the lock.
 When I restart the bulk load process, it failed at different regions but with 
 the same RegionTooBusyExceptions. 
 I guess maybe the region was doing some compactions at that time and owned 
 the lock, but I couldn’t find compaction info in the hbase-logs.
 Finally, after several days’ hard work, the only temporary solution to this 
 problem was found, that is TRIGGERING A MAJOR COMPACTION BEFORE THE BULKLOAD, 
 So which process owned the lock? Has anyone came across the same problem 
 before?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10881) Support reverse scan in thrift2

2014-04-01 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13957267#comment-13957267
 ] 

Liang Xie commented on HBASE-10881:
---

patch v2 lgtm
Failed case should not relative with current modification.

 Support reverse scan in thrift2
 ---

 Key: HBASE-10881
 URL: https://issues.apache.org/jira/browse/HBASE-10881
 Project: HBase
  Issue Type: New Feature
  Components: Thrift
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.99.0

 Attachments: HBASE-10881-trunk-v1.diff, HBASE-10881-trunk-v2.diff


 Support reverse scan in thrift2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10788) Add 99th percentile of latency in PE

2014-03-27 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10788:
--

   Resolution: Fixed
Fix Version/s: 0.99.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

 Add 99th percentile of latency in PE
 

 Key: HBASE-10788
 URL: https://issues.apache.org/jira/browse/HBASE-10788
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.99.0

 Attachments: HBASE-10788-trunk-v1.diff, HBASE-10788-trunk-v2.diff, 
 HBASE-10788-trunk-v3.diff


 In production env, 99th percentile of latency is more important than the avg. 
 The 99th percentile is helpful to measure the influence of GC, slow 
 read/write of HDFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10788) Add 99th percentile of latency in PE

2014-03-27 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950298#comment-13950298
 ] 

Liang Xie commented on HBASE-10788:
---

Integrated into trunk, thanks all for review, thank you for the patch 
[~liushaohui] :)

 Add 99th percentile of latency in PE
 

 Key: HBASE-10788
 URL: https://issues.apache.org/jira/browse/HBASE-10788
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.99.0

 Attachments: HBASE-10788-trunk-v1.diff, HBASE-10788-trunk-v2.diff, 
 HBASE-10788-trunk-v3.diff


 In production env, 99th percentile of latency is more important than the avg. 
 The 99th percentile is helpful to measure the influence of GC, slow 
 read/write of HDFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-9679) Binary search in HFile block

2014-03-27 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950306#comment-13950306
 ] 

Liang Xie commented on HBASE-9679:
--

when i filed this issue long long time ago, my thought was that we could impl 
an adaptive index algo which is similar with Cassandra's, that means if we 
detect a hotspot kv inside a HFile block in lots of read requests, then we 
could build a index upon this kv, the most code be modified should the 
HFileIndex class.  But there's is a little risk to just do it only in 0.94 
branch:)  since we have PrefixTree already in later branch.

 Binary search in HFile block
 

 Key: HBASE-9679
 URL: https://issues.apache.org/jira/browse/HBASE-9679
 Project: HBase
  Issue Type: Improvement
  Components: HFile
Affects Versions: 0.95.2, 0.94.12
Reporter: Liang Xie
Assignee: Liang Xie
Priority: Minor

 It's not a top priority issue, seems to me.
 Right now hbase do a linear scan to search a key within a hfile block on 
 interst, in special case, e.g. 100% read scenario or high read/write ratio 
 scanario, it's useful to do a binary search improvement to reduce the CPU 
 cost and response time,  i think the biggest benefit should be the cpu:)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10774) Restore TestMultiTableInputFormat

2014-03-25 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13946289#comment-13946289
 ] 

Liang Xie commented on HBASE-10774:
---

will commit HBASE-10774-trunk-v2.patch tomorrow if no objection.
[~liushaohui], let's put an eye on here even committed.  considering the above 
kindly concerns: 1)test running time on Jenkins 2)digging in if it fails up 
on apache after being reenabled

 Restore TestMultiTableInputFormat
 -

 Key: HBASE-10774
 URL: https://issues.apache.org/jira/browse/HBASE-10774
 Project: HBase
  Issue Type: Test
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10774-trunk-v2.diff, HBASE-10774-trunk-v2.patch, 
 HBASE-10774-v1.diff


 TestMultiTableInputFormat is removed in HBASE-9009 for this test made the ci 
 failed. But in HBASE-10692 we need to add a new test 
 TestSecureMultiTableInputFormat which is depends on it. So we try to restore 
 it in this issue.
 I rerun the test for several times and it passed.
 {code}
 Running org.apache.hadoop.hbase.mapreduce.TestMultiTableInputFormat
 Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 314.163 sec
 {code}
 [~stack]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10788) Add 99th percentile of latency in PE

2014-03-25 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10788:
--

Affects Version/s: 0.99.0
   Status: Patch Available  (was: Open)

 Add 99th percentile of latency in PE
 

 Key: HBASE-10788
 URL: https://issues.apache.org/jira/browse/HBASE-10788
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10788-trunk-v1.diff, HBASE-10788-trunk-v2.diff


 In production env, 99th percentile of latency is more important than the avg. 
 The 99th percentile is helpful to measure the influence of GC, slow 
 read/write of HDFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10788) Add 99th percentile of latency in PE

2014-03-25 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13946290#comment-13946290
 ] 

Liang Xie commented on HBASE-10788:
---

emmm... [~liushaohui] you forgot to click Submit Patch, so it's not on the 
radar:)

 Add 99th percentile of latency in PE
 

 Key: HBASE-10788
 URL: https://issues.apache.org/jira/browse/HBASE-10788
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10788-trunk-v1.diff, HBASE-10788-trunk-v2.diff


 In production env, 99th percentile of latency is more important than the avg. 
 The 99th percentile is helpful to measure the influence of GC, slow 
 read/write of HDFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HBASE-9679) Binary search in HFile block

2014-03-25 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie resolved HBASE-9679.
--

Resolution: Won't Fix

It's seems not a good practice to do it in 0.94 branch(0.96+ has the 
PrefixTree) due to need a HFile index code refactoring.
Close it now, clean up JIRA:)

 Binary search in HFile block
 

 Key: HBASE-9679
 URL: https://issues.apache.org/jira/browse/HBASE-9679
 Project: HBase
  Issue Type: Improvement
  Components: HFile
Affects Versions: 0.95.2, 0.94.12
Reporter: Liang Xie
Assignee: Liang Xie
Priority: Minor

 It's not a top priority issue, seems to me.
 Right now hbase do a linear scan to search a key within a hfile block on 
 interst, in special case, e.g. 100% read scenario or high read/write ratio 
 scanario, it's useful to do a binary search improvement to reduce the CPU 
 cost and response time,  i think the biggest benefit should be the cpu:)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10788) Add 99th percentile of latency in PE

2014-03-25 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947564#comment-13947564
 ] 

Liang Xie commented on HBASE-10788:
---

+1. will commit tomorrow if no objection

 Add 99th percentile of latency in PE
 

 Key: HBASE-10788
 URL: https://issues.apache.org/jira/browse/HBASE-10788
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10788-trunk-v1.diff, HBASE-10788-trunk-v2.diff


 In production env, 99th percentile of latency is more important than the avg. 
 The 99th percentile is helpful to measure the influence of GC, slow 
 read/write of HDFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10806) Two protos missing in hbase-protocol/pom.xml

2014-03-21 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10806:
--

Assignee: Liu Shaohui

 Two protos missing in hbase-protocol/pom.xml
 

 Key: HBASE-10806
 URL: https://issues.apache.org/jira/browse/HBASE-10806
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
 Attachments: HBASE-10806-trunk-v1.diff


 VisibilityLabels.proto and Encryption.proto are missing in 
 hbase-protocol/pom.xml. The corresponding classes are not regenerated in 
 maven cmd:
 {code}
 mvn compile -Pcompile-protobuf
 {code} 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10806) Two protos missing in hbase-protocol/pom.xml

2014-03-21 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10806:
--

Status: Patch Available  (was: Open)

 Two protos missing in hbase-protocol/pom.xml
 

 Key: HBASE-10806
 URL: https://issues.apache.org/jira/browse/HBASE-10806
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
 Attachments: HBASE-10806-trunk-v1.diff


 VisibilityLabels.proto and Encryption.proto are missing in 
 hbase-protocol/pom.xml. The corresponding classes are not regenerated in 
 maven cmd:
 {code}
 mvn compile -Pcompile-protobuf
 {code} 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10782) Hadoop2 MR tests fail occasionally because of mapreduce.jobhistory.address is no set in job conf

2014-03-19 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10782:
--

   Resolution: Fixed
Fix Version/s: 0.98.2
   0.94.18
   0.99.0
   0.96.2
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

 Hadoop2 MR tests fail occasionally because of mapreduce.jobhistory.address is 
 no set in job conf
 

 Key: HBASE-10782
 URL: https://issues.apache.org/jira/browse/HBASE-10782
 Project: HBase
  Issue Type: Test
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.96.2, 0.99.0, 0.94.18, 0.98.2

 Attachments: HBASE-10782-trunk-v1.diff


 Hadoop2 MR tests fail occasionally with output like this:
 {code}
 ---
 Test set: org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan1
 ---
 Tests run: 5, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 347.57 sec 
  FAILURE!
 testScanEmptyToAPP(org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan1)
   Time elapsed: 50.047 sec   ERROR!
 java.io.IOException: java.net.ConnectException: Call From 
 liushaohui-OptiPlex-990/127.0.0.1 to 0.0.0.0:10020 failed on connection 
 exception: java.net.ConnectException: Connection refused; For more details 
 see:  http://wiki.apache.org/hadoop/ConnectionRefused
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:334)
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
   at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:524)
   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:314)
   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
   at java.security.AccessController.doPrivileged(Native Method)
  ...
 {code}
 The reason is that when MR job was running, the job client pulled the job 
 status from AppMaster. When the job is completed, the AppMaster will exit. At 
 this time, if the job client have not got the job completed event from 
 AppMaster, it will switch to get job report from history server. 
 But in HBaseTestingUtility#startMiniMapReduceCluster, the config: 
 mapreduce.jobhistory.address is not copied to TestUtil's config.
  
 CRUNCH-249 reported the same problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10782) Hadoop2 MR tests fail occasionally because of mapreduce.jobhistory.address is no set in job conf

2014-03-19 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941357#comment-13941357
 ] 

Liang Xie commented on HBASE-10782:
---

Integreated into trunk and 0.94/0.96/0.98 branches. Thanks [~liushaohui] for 
your finding and patch!  Thanks all for review.

 Hadoop2 MR tests fail occasionally because of mapreduce.jobhistory.address is 
 no set in job conf
 

 Key: HBASE-10782
 URL: https://issues.apache.org/jira/browse/HBASE-10782
 Project: HBase
  Issue Type: Test
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.96.2, 0.99.0, 0.94.18, 0.98.2

 Attachments: HBASE-10782-trunk-v1.diff


 Hadoop2 MR tests fail occasionally with output like this:
 {code}
 ---
 Test set: org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan1
 ---
 Tests run: 5, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 347.57 sec 
  FAILURE!
 testScanEmptyToAPP(org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan1)
   Time elapsed: 50.047 sec   ERROR!
 java.io.IOException: java.net.ConnectException: Call From 
 liushaohui-OptiPlex-990/127.0.0.1 to 0.0.0.0:10020 failed on connection 
 exception: java.net.ConnectException: Connection refused; For more details 
 see:  http://wiki.apache.org/hadoop/ConnectionRefused
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:334)
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
   at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:524)
   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:314)
   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
   at java.security.AccessController.doPrivileged(Native Method)
  ...
 {code}
 The reason is that when MR job was running, the job client pulled the job 
 status from AppMaster. When the job is completed, the AppMaster will exit. At 
 this time, if the job client have not got the job completed event from 
 AppMaster, it will switch to get job report from history server. 
 But in HBaseTestingUtility#startMiniMapReduceCluster, the config: 
 mapreduce.jobhistory.address is not copied to TestUtil's config.
  
 CRUNCH-249 reported the same problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10774) Restore TestMultiTableInputFormat

2014-03-19 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941365#comment-13941365
 ] 

Liang Xie commented on HBASE-10774:
---

+1 for v2.
[~saint@gmail.com], any thoughts about letting this in again?  Shaohui did 
some minor improvements to reduce the test duration:
{quote}
(1) set the region num to 4 , as small as possilble
(2) set writeToWAL = false in loadTable
(3) remove test testScanOPPToEmpty which duplicate with testScanYZYToEmpty
{quote}

 Restore TestMultiTableInputFormat
 -

 Key: HBASE-10774
 URL: https://issues.apache.org/jira/browse/HBASE-10774
 Project: HBase
  Issue Type: Test
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10774-trunk-v2.diff, HBASE-10774-v1.diff


 TestMultiTableInputFormat is removed in HBASE-9009 for this test made the ci 
 failed. But in HBASE-10692 we need to add a new test 
 TestSecureMultiTableInputFormat which is depends on it. So we try to restore 
 it in this issue.
 I rerun the test for several times and it passed.
 {code}
 Running org.apache.hadoop.hbase.mapreduce.TestMultiTableInputFormat
 Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 314.163 sec
 {code}
 [~stack]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10796) Set default log level as INFO

2014-03-19 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941366#comment-13941366
 ] 

Liang Xie commented on HBASE-10796:
---

+1 for this idea.  We set INFO level in ALL production cluster and DEBUG in 
testing cluster in the past:)

 Set default log level as INFO
 -

 Key: HBASE-10796
 URL: https://issues.apache.org/jira/browse/HBASE-10796
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: stack

 When we roll out 1.0, the log level should be INFO-level by default, not 
 DEBUG. 
 Proposed on mailing list here 
 http://search-hadoop.com/m/33P7E1GL08b/hbase+1.0subj=DISCUSSION+1+0+0 and at 
 least one other +1 with no objection.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10790) make assembly:single as default in pom.xml

2014-03-19 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10790:
--

Assignee: Liu Shaohui

 make assembly:single as default in pom.xml
 --

 Key: HBASE-10790
 URL: https://issues.apache.org/jira/browse/HBASE-10790
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10790-trunk-v1.diff


 Now to compile a HBase tar release package, we should use
 the cmd: 
 {code}
  mvn clean package assembly:single
 {code}, which is not convenient. We can make assembly:single as a default 
 option and run the assembly plugin in maven package phase. Then we can just 
 use the cmd {code} mvn clean package {code} to get a release package.
 Other suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10782) Hadoop2 MR tests fail occasionally because of mapreduce.jobhistory.address is no set in job conf

2014-03-19 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941375#comment-13941375
 ] 

Liang Xie commented on HBASE-10782:
---

Oh, [~lhofhansl] [~saint@gmail.com], sorry for just see your release RC 
mail, please feel free to change Fix Version/s field if it's not correct...

 Hadoop2 MR tests fail occasionally because of mapreduce.jobhistory.address is 
 no set in job conf
 

 Key: HBASE-10782
 URL: https://issues.apache.org/jira/browse/HBASE-10782
 Project: HBase
  Issue Type: Test
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.96.2, 0.99.0, 0.94.18, 0.98.2

 Attachments: HBASE-10782-trunk-v1.diff


 Hadoop2 MR tests fail occasionally with output like this:
 {code}
 ---
 Test set: org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan1
 ---
 Tests run: 5, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 347.57 sec 
  FAILURE!
 testScanEmptyToAPP(org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan1)
   Time elapsed: 50.047 sec   ERROR!
 java.io.IOException: java.net.ConnectException: Call From 
 liushaohui-OptiPlex-990/127.0.0.1 to 0.0.0.0:10020 failed on connection 
 exception: java.net.ConnectException: Connection refused; For more details 
 see:  http://wiki.apache.org/hadoop/ConnectionRefused
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:334)
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
   at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:524)
   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:314)
   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
   at java.security.AccessController.doPrivileged(Native Method)
  ...
 {code}
 The reason is that when MR job was running, the job client pulled the job 
 status from AppMaster. When the job is completed, the AppMaster will exit. At 
 this time, if the job client have not got the job completed event from 
 AppMaster, it will switch to get job report from history server. 
 But in HBaseTestingUtility#startMiniMapReduceCluster, the config: 
 mapreduce.jobhistory.address is not copied to TestUtil's config.
  
 CRUNCH-249 reported the same problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10782) Hadoop2 MR tests fail occasionally because of mapreduce.jobhistory.address is no set in job conf

2014-03-18 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940152#comment-13940152
 ] 

Liang Xie commented on HBASE-10782:
---

+1,  will commit it tomorrow if no objection

 Hadoop2 MR tests fail occasionally because of mapreduce.jobhistory.address is 
 no set in job conf
 

 Key: HBASE-10782
 URL: https://issues.apache.org/jira/browse/HBASE-10782
 Project: HBase
  Issue Type: Test
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10782-trunk-v1.diff


 Hadoop2 MR tests fail occasionally with output like this:
 {code}
 ---
 Test set: org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan1
 ---
 Tests run: 5, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 347.57 sec 
  FAILURE!
 testScanEmptyToAPP(org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan1)
   Time elapsed: 50.047 sec   ERROR!
 java.io.IOException: java.net.ConnectException: Call From 
 liushaohui-OptiPlex-990/127.0.0.1 to 0.0.0.0:10020 failed on connection 
 exception: java.net.ConnectException: Connection refused; For more details 
 see:  http://wiki.apache.org/hadoop/ConnectionRefused
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:334)
   at 
 org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
   at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:524)
   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:314)
   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
   at java.security.AccessController.doPrivileged(Native Method)
  ...
 {code}
 The reason is that when MR job was running, the job client pulled the job 
 status from AppMaster. When the job is completed, the AppMaster will exit. At 
 this time, if the job client have not got the job completed event from 
 AppMaster, it will switch to get job report from history server. 
 But in HBaseTestingUtility#startMiniMapReduceCluster, the config: 
 mapreduce.jobhistory.address is not copied to TestUtil's config.
  
 CRUNCH-249 reported the same problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10774) Restore TestMultiTableInputFormat

2014-03-18 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940178#comment-13940178
 ] 

Liang Xie commented on HBASE-10774:
---

I assume the long test duration is not a blocker for current JIRA, probably it 
could be tracked by another new JIRA, what do you think, andy ?
I say above because per discription, HBASE-10692 depends on current JIRA, per 
my understanding, we can let this in now.

 Restore TestMultiTableInputFormat
 -

 Key: HBASE-10774
 URL: https://issues.apache.org/jira/browse/HBASE-10774
 Project: HBase
  Issue Type: Test
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10774-trunk-v2.diff, HBASE-10774-v1.diff


 TestMultiTableInputFormat is removed in HBASE-9009 for this test made the ci 
 failed. But in HBASE-10692 we need to add a new test 
 TestSecureMultiTableInputFormat which is depends on it. So we try to restore 
 it in this issue.
 I rerun the test for several times and it passed.
 {code}
 Running org.apache.hadoop.hbase.mapreduce.TestMultiTableInputFormat
 Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 314.163 sec
 {code}
 [~stack]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10788) Add 99th percentile of latency in PE

2014-03-18 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940190#comment-13940190
 ] 

Liang Xie commented on HBASE-10788:
---

FYI, Nick have done a good latency job inside PE, see HBASE-10007. so your plan 
is to enhance it into more operation or?

 Add 99th percentile of latency in PE
 

 Key: HBASE-10788
 URL: https://issues.apache.org/jira/browse/HBASE-10788
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10788-trunk-v1.diff


 In production env, 99th percentile of latency is more important than the avg. 
 The 99th percentile is helpful to measure the influence of GC, slow 
 read/write of HDFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10774) Restore TestMultiTableInputFormat

2014-03-17 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10774:
--

Assignee: Liu Shaohui

 Restore TestMultiTableInputFormat
 -

 Key: HBASE-10774
 URL: https://issues.apache.org/jira/browse/HBASE-10774
 Project: HBase
  Issue Type: Test
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10774-v1.diff


 TestMultiTableInputFormat is removed in HBASE-9009 for this test made the ci 
 failed. But in HBASE-10692 we need to add a new test 
 TestSecureMultiTableInputFormat which is depends on it. So we try to restore 
 it in this issue.
 I rerun the test for several times and it passed.
 {code}
 Running org.apache.hadoop.hbase.mapreduce.TestMultiTableInputFormat
 Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 314.163 sec
 {code}
 [~stack]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-10774) Restore TestMultiTableInputFormat

2014-03-17 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10774:
--

Affects Version/s: 0.99.0
   Status: Patch Available  (was: Open)

 Restore TestMultiTableInputFormat
 -

 Key: HBASE-10774
 URL: https://issues.apache.org/jira/browse/HBASE-10774
 Project: HBase
  Issue Type: Test
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: HBASE-10774-v1.diff


 TestMultiTableInputFormat is removed in HBASE-9009 for this test made the ci 
 failed. But in HBASE-10692 we need to add a new test 
 TestSecureMultiTableInputFormat which is depends on it. So we try to restore 
 it in this issue.
 I rerun the test for several times and it passed.
 {code}
 Running org.apache.hadoop.hbase.mapreduce.TestMultiTableInputFormat
 Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 314.163 sec
 {code}
 [~stack]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-8894) Forward port compressed l2 cache from 0.89fb

2014-03-01 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-8894:
-

Assignee: (was: Liang Xie)

 Forward port compressed l2 cache from 0.89fb
 

 Key: HBASE-8894
 URL: https://issues.apache.org/jira/browse/HBASE-8894
 Project: HBase
  Issue Type: New Feature
Reporter: stack
Priority: Critical
 Attachments: HBASE-8894-0.94-v1.txt, HBASE-8894-0.94-v2.txt


 Forward port Alex's improvement on hbase-7407 from 0.89-fb branch:
 {code}
   1 r1492797 | liyin | 2013-06-13 11:18:20 -0700 (Thu, 13 Jun 2013) | 43 lines
   2
   3 [master] Implements a secondary compressed cache (L2 cache)
   4
   5 Author: avf
   6
   7 Summary:
   8 This revision implements compressed and encoded second-level cache with 
 off-heap
   9 (and optionally on-heap) storage and a bucket-allocator based on 
 HBASE-7404.
  10
  11 BucketCache from HBASE-7404 is extensively modified to:
  12
  13 * Only handle byte arrays (i.e., no more serialization/deserialization 
 within)
  14 * Remove persistence support for the time being
  15 * Keep an  index of hfilename to blocks for efficient eviction on close
  16
  17 A new interface (L2Cache) is introduced in order to separate it from the 
 current
  18 implementation. The L2 cache is then integrated into the classes that 
 handle
  19 reading from and writing to HFiles to allow cache-on-write as well as
  20 cache-on-read. Metrics for the L2 cache are integrated into 
 RegionServerMetrics
  21 much in the same fashion as metrics for the existing (L2) BlockCache.
  22
  23 Additionally, CacheConfig class is re-refactored to configure the L2 
 cache,
  24 replace multile constructors with a Builder, as well as replace static 
 methods
  25 for instantiating the caches with abstract factories (with singleton
  26 implementations for both the existing LruBlockCache and the newly 
 introduced
  27 BucketCache based L2 cache)
  28
  29 Test Plan:
  30 1) Additional unit tests
  31 2) Stress test on a single devserver
  32 3) Test on a single-node in shadow cluster
  33 4) Test on a whole shadow cluster
  34
  35 Revert Plan:
  36
  37 Reviewers: liyintang, aaiyer, rshroff, manukranthk, adela
  38
  39 Reviewed By: liyintang
  40
  41 CC: gqchen, hbase-eng@
  42
  43 Differential Revision: https://phabricator.fb.com/D837264
  44
  45 Task ID: 2325295
  7 
   6 r1492340 | liyin | 2013-06-12 11:36:03 -0700 (Wed, 12 Jun 2013) | 21 lines
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-8894) Forward port compressed l2 cache from 0.89fb

2014-03-01 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917283#comment-13917283
 ] 

Liang Xie commented on HBASE-8894:
--

[~sudarsh], sorry for late reply, i was not active for this JIRA these days, 
since after several raw perf testing, no more gc benefit in our scenarios 
compared with another bucket cache feature.  unassigned from me now.

 Forward port compressed l2 cache from 0.89fb
 

 Key: HBASE-8894
 URL: https://issues.apache.org/jira/browse/HBASE-8894
 Project: HBase
  Issue Type: New Feature
Reporter: stack
Priority: Critical
 Attachments: HBASE-8894-0.94-v1.txt, HBASE-8894-0.94-v2.txt


 Forward port Alex's improvement on hbase-7407 from 0.89-fb branch:
 {code}
   1 r1492797 | liyin | 2013-06-13 11:18:20 -0700 (Thu, 13 Jun 2013) | 43 lines
   2
   3 [master] Implements a secondary compressed cache (L2 cache)
   4
   5 Author: avf
   6
   7 Summary:
   8 This revision implements compressed and encoded second-level cache with 
 off-heap
   9 (and optionally on-heap) storage and a bucket-allocator based on 
 HBASE-7404.
  10
  11 BucketCache from HBASE-7404 is extensively modified to:
  12
  13 * Only handle byte arrays (i.e., no more serialization/deserialization 
 within)
  14 * Remove persistence support for the time being
  15 * Keep an  index of hfilename to blocks for efficient eviction on close
  16
  17 A new interface (L2Cache) is introduced in order to separate it from the 
 current
  18 implementation. The L2 cache is then integrated into the classes that 
 handle
  19 reading from and writing to HFiles to allow cache-on-write as well as
  20 cache-on-read. Metrics for the L2 cache are integrated into 
 RegionServerMetrics
  21 much in the same fashion as metrics for the existing (L2) BlockCache.
  22
  23 Additionally, CacheConfig class is re-refactored to configure the L2 
 cache,
  24 replace multile constructors with a Builder, as well as replace static 
 methods
  25 for instantiating the caches with abstract factories (with singleton
  26 implementations for both the existing LruBlockCache and the newly 
 introduced
  27 BucketCache based L2 cache)
  28
  29 Test Plan:
  30 1) Additional unit tests
  31 2) Stress test on a single devserver
  32 3) Test on a single-node in shadow cluster
  33 4) Test on a whole shadow cluster
  34
  35 Revert Plan:
  36
  37 Reviewers: liyintang, aaiyer, rshroff, manukranthk, adela
  38
  39 Reviewed By: liyintang
  40
  41 CC: gqchen, hbase-eng@
  42
  43 Differential Revision: https://phabricator.fb.com/D837264
  44
  45 Task ID: 2325295
  7 
   6 r1492340 | liyin | 2013-06-12 11:36:03 -0700 (Wed, 12 Jun 2013) | 21 lines
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10088) SecureClient will hang when access secure-disabled cluster

2014-02-27 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10088:
--

Assignee: cuijianwei

 SecureClient will hang when access secure-disabled cluster
 --

 Key: HBASE-10088
 URL: https://issues.apache.org/jira/browse/HBASE-10088
 Project: HBase
  Issue Type: Improvement
  Components: Client, security
Affects Versions: 0.94.14
Reporter: cuijianwei
Assignee: cuijianwei
 Attachments: HBASE-10088-0.94-v1.patch, HBASE-10088-0.94-v2.patch, 
 HBASE-10088-0.94-v2.patch


 When I misuse a secure hbase client to access a secure-disabled hbase server, 
 I found the client will hang. The reason is that client will firstly invoke 
 rpc method getProtocolVersion, and the response from a secure-disabled 
 server won't contain necessary fields processed by SecureClient. SecureClient 
 will process the response as follows : (from SecureClient.receiveResponse()):
 {code}
 if (state == Status.SUCCESS.state) {
   Writable value = ReflectionUtils.newInstance(valueClass, conf);
   value.readFields(in); // read value
   if (LOG.isDebugEnabled()) {
 LOG.debug(call #+id+, response is:\n+value.toString());
   }
   // it's possible that this call may have been cleaned up due to a 
 RPC
   // timeout, so check if it still exists before setting the value.
   if (call != null) {
 call.setValue(value);
   }
 } else if (state == Status.ERROR.state) {
   if (call != null) {
 call.setException(new 
 RemoteException(WritableUtils.readString(in), WritableUtils
 .readString(in)));
   }
 } else if (state == Status.FATAL.state) {
   RemoteException exception = new 
 RemoteException(WritableUtils.readString(in),
   WritableUtils.readString(in));
   // the call will be removed from call map, we must set Exception 
 here to notify
   // the thread waited on the call
   if (call != null) {
 call.setException(exception);
   }
   // Close the connection
   markClosed(exception);
 }
 calls.remove(id);
 {code}
 As the above code, SecureClient need to read 'state' field from response. If 
 the response is from a secure-disabled server, there will no 'state' field in 
 response and SecureClient will get an illegal 'state', then the call will be 
 removed from cached calls without notifying waiting thread. This will make 
 the invoker waiting all the time. Although we should not use secure client to 
 access secure-disabled server, users might encounter this situation because 
 of misusing or error configuration. If the client will hang in this 
 situation, users might not know the error quickly. Maybe, it is better to 
 report an error in this situation so that users will know what happens 
 quickly.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10088) SecureClient will hang when access secure-disabled cluster

2014-02-27 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10088:
--

Status: Patch Available  (was: Open)

Kick off QA robot:)

 SecureClient will hang when access secure-disabled cluster
 --

 Key: HBASE-10088
 URL: https://issues.apache.org/jira/browse/HBASE-10088
 Project: HBase
  Issue Type: Improvement
  Components: Client, security
Affects Versions: 0.94.14
Reporter: cuijianwei
Assignee: cuijianwei
 Attachments: HBASE-10088-0.94-v1.patch, HBASE-10088-0.94-v2.patch, 
 HBASE-10088-0.94-v2.patch


 When I misuse a secure hbase client to access a secure-disabled hbase server, 
 I found the client will hang. The reason is that client will firstly invoke 
 rpc method getProtocolVersion, and the response from a secure-disabled 
 server won't contain necessary fields processed by SecureClient. SecureClient 
 will process the response as follows : (from SecureClient.receiveResponse()):
 {code}
 if (state == Status.SUCCESS.state) {
   Writable value = ReflectionUtils.newInstance(valueClass, conf);
   value.readFields(in); // read value
   if (LOG.isDebugEnabled()) {
 LOG.debug(call #+id+, response is:\n+value.toString());
   }
   // it's possible that this call may have been cleaned up due to a 
 RPC
   // timeout, so check if it still exists before setting the value.
   if (call != null) {
 call.setValue(value);
   }
 } else if (state == Status.ERROR.state) {
   if (call != null) {
 call.setException(new 
 RemoteException(WritableUtils.readString(in), WritableUtils
 .readString(in)));
   }
 } else if (state == Status.FATAL.state) {
   RemoteException exception = new 
 RemoteException(WritableUtils.readString(in),
   WritableUtils.readString(in));
   // the call will be removed from call map, we must set Exception 
 here to notify
   // the thread waited on the call
   if (call != null) {
 call.setException(exception);
   }
   // Close the connection
   markClosed(exception);
 }
 calls.remove(id);
 {code}
 As the above code, SecureClient need to read 'state' field from response. If 
 the response is from a secure-disabled server, there will no 'state' field in 
 response and SecureClient will get an illegal 'state', then the call will be 
 removed from cached calls without notifying waiting thread. This will make 
 the invoker waiting all the time. Although we should not use secure client to 
 access secure-disabled server, users might encounter this situation because 
 of misusing or error configuration. If the client will hang in this 
 situation, users might not know the error quickly. Maybe, it is better to 
 report an error in this situation so that users will know what happens 
 quickly.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10088) SecureClient will hang when access secure-disabled cluster

2014-02-27 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13914465#comment-13914465
 ] 

Liang Xie commented on HBASE-10088:
---

Oh, mind uploading a trunk patch if possible, jianwei ? :)

 SecureClient will hang when access secure-disabled cluster
 --

 Key: HBASE-10088
 URL: https://issues.apache.org/jira/browse/HBASE-10088
 Project: HBase
  Issue Type: Improvement
  Components: Client, security
Affects Versions: 0.94.14
Reporter: cuijianwei
Assignee: cuijianwei
 Attachments: HBASE-10088-0.94-v1.patch, HBASE-10088-0.94-v2.patch, 
 HBASE-10088-0.94-v2.patch


 When I misuse a secure hbase client to access a secure-disabled hbase server, 
 I found the client will hang. The reason is that client will firstly invoke 
 rpc method getProtocolVersion, and the response from a secure-disabled 
 server won't contain necessary fields processed by SecureClient. SecureClient 
 will process the response as follows : (from SecureClient.receiveResponse()):
 {code}
 if (state == Status.SUCCESS.state) {
   Writable value = ReflectionUtils.newInstance(valueClass, conf);
   value.readFields(in); // read value
   if (LOG.isDebugEnabled()) {
 LOG.debug(call #+id+, response is:\n+value.toString());
   }
   // it's possible that this call may have been cleaned up due to a 
 RPC
   // timeout, so check if it still exists before setting the value.
   if (call != null) {
 call.setValue(value);
   }
 } else if (state == Status.ERROR.state) {
   if (call != null) {
 call.setException(new 
 RemoteException(WritableUtils.readString(in), WritableUtils
 .readString(in)));
   }
 } else if (state == Status.FATAL.state) {
   RemoteException exception = new 
 RemoteException(WritableUtils.readString(in),
   WritableUtils.readString(in));
   // the call will be removed from call map, we must set Exception 
 here to notify
   // the thread waited on the call
   if (call != null) {
 call.setException(exception);
   }
   // Close the connection
   markClosed(exception);
 }
 calls.remove(id);
 {code}
 As the above code, SecureClient need to read 'state' field from response. If 
 the response is from a secure-disabled server, there will no 'state' field in 
 response and SecureClient will get an illegal 'state', then the call will be 
 removed from cached calls without notifying waiting thread. This will make 
 the invoker waiting all the time. Although we should not use secure client to 
 access secure-disabled server, users might encounter this situation because 
 of misusing or error configuration. If the client will hang in this 
 situation, users might not know the error quickly. Maybe, it is better to 
 report an error in this situation so that users will know what happens 
 quickly.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10598) Written data can not be read out because MemStore#timeRangeTracker might be updated concurrently

2014-02-24 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10598:
--

Priority: Critical  (was: Major)

 Written data can not be read out because MemStore#timeRangeTracker might be 
 updated concurrently
 

 Key: HBASE-10598
 URL: https://issues.apache.org/jira/browse/HBASE-10598
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.16
Reporter: cuijianwei
Assignee: cuijianwei
Priority: Critical
 Attachments: HBASE-10598-0.94.v1.patch


 In our test environment, we find written data can't be read out occasionally. 
 After debugging, we find that maximumTimestamp/minimumTimestamp of 
 MemStore#timeRangeTracker might decrease/increase when 
 MemStore#timeRangeTracker is updated concurrently, which might make the 
 MemStore/StoreFile to be filtered incorrectly when reading data out. Let's 
 see how the concurrent updating of timeRangeTracker#maximumTimestamp cause 
 this problem. 
 Imagining there are two threads T1 and T2 putting two KeyValues kv1 and kv2. 
 kv1 and kv2 belong to the same Store(so belong to the same region), but 
 contain different rowkeys. Consequently, kv1 and kv2 could be updated 
 concurrently. When we see the implementation of HRegionServer#multi, kv1 and 
 kv2 will be add to MemStore by HRegion#applyFamilyMapToMemstore in 
 HRegion#doMiniBatchMutation. Then, MemStore#internalAdd will be invoked and 
 MemStore#timeRangeTracker will be updated by 
 TimeRangeTracker#includeTimestamp as follows:
 {code}
   private void includeTimestamp(final long timestamp) {
  ...
 else if (maximumTimestamp  timestamp) {
   maximumTimestamp = timestamp;
 }
 return;
   }
 {code}
 Imagining the current maximumTimestamp of TimeRangeTracker is t0 before 
 includeTimestamp(...) invoked, kv1.timestamp=t1,  kv2.timestamp=t2, t1 and t2 
 are both set by user(then, user knows the timestamps of kv1 and kv2), and t1 
  t2  t0. T1 and T2 will be executed concurrently, therefore, the two 
 threads might both find the current maximumTimestamp is less than the 
 timestamp of its kv. After that, T1 and T2 will both set maximumTimestamp to 
 timestamp of its kv. If T1 set maximumTimestamp before T2 doing that, the 
 maximumTimestamp will be set to t2. Then, before any new update with bigger 
 timestamp has been applied to the MemStore, if we try to read out kv1 by 
 HTable#get and set the timestamp of 'Get' to t1, the StoreScanner will decide 
 whether the MemStoreScanner(imagining kv1 has not been flushed) should be 
 selected as candidate scanner by MemStoreScanner#shouldUseScanner. Then, the 
 MemStore won't be selected in MemStoreScanner#shouldUseScanner because 
 maximumTimestamp of the MemStore has been set to t2 (t2  t1). Consequently, 
 the written kv1 can't be read out and kv1 is lost from user's perspective.
 If the above analysis is right, after maximumTimestamp of 
 MemStore#timeRangeTracker has been set to t2, user will experience data lass 
 in the following situations:
 1. Before any new write with kv.timestamp  t1 has been add to the MemStore, 
 read request of kv1 with timestamp=t1 can not read kv1 out.
 2. Before any new write with kv.timestamp  t1 has been add to the MemStore, 
 if a flush happened, the data of MemStore will be flushed to StoreFile with 
 StoreFile#maximumTimestamp set to t2. After that, any read request with 
 timestamp=t1 can not read kv1 before next compaction(Actually, kv1.timestamp 
 might not be included in timeRange of the StoreFile even after compaction).
 The second situation is much more serious because the incorrect timeRange of 
 MemStore has been persisted to the file. 
 Similarly, the concurrent update of TimeRangeTracker#minimumTimestamp may 
 also cause this problem.
 As a simple way to fix the problem, we could add synchronized to 
 TimeRangeTracker#includeTimestamp so that this method won't be invoked 
 concurrently.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HBASE-10598) Written data can not be read out because MemStore#timeRangeTracker might be updated concurrently

2014-02-24 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HBASE-10598:
--

Assignee: cuijianwei

 Written data can not be read out because MemStore#timeRangeTracker might be 
 updated concurrently
 

 Key: HBASE-10598
 URL: https://issues.apache.org/jira/browse/HBASE-10598
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.16
Reporter: cuijianwei
Assignee: cuijianwei
 Attachments: HBASE-10598-0.94.v1.patch


 In our test environment, we find written data can't be read out occasionally. 
 After debugging, we find that maximumTimestamp/minimumTimestamp of 
 MemStore#timeRangeTracker might decrease/increase when 
 MemStore#timeRangeTracker is updated concurrently, which might make the 
 MemStore/StoreFile to be filtered incorrectly when reading data out. Let's 
 see how the concurrent updating of timeRangeTracker#maximumTimestamp cause 
 this problem. 
 Imagining there are two threads T1 and T2 putting two KeyValues kv1 and kv2. 
 kv1 and kv2 belong to the same Store(so belong to the same region), but 
 contain different rowkeys. Consequently, kv1 and kv2 could be updated 
 concurrently. When we see the implementation of HRegionServer#multi, kv1 and 
 kv2 will be add to MemStore by HRegion#applyFamilyMapToMemstore in 
 HRegion#doMiniBatchMutation. Then, MemStore#internalAdd will be invoked and 
 MemStore#timeRangeTracker will be updated by 
 TimeRangeTracker#includeTimestamp as follows:
 {code}
   private void includeTimestamp(final long timestamp) {
  ...
 else if (maximumTimestamp  timestamp) {
   maximumTimestamp = timestamp;
 }
 return;
   }
 {code}
 Imagining the current maximumTimestamp of TimeRangeTracker is t0 before 
 includeTimestamp(...) invoked, kv1.timestamp=t1,  kv2.timestamp=t2, t1 and t2 
 are both set by user(then, user knows the timestamps of kv1 and kv2), and t1 
  t2  t0. T1 and T2 will be executed concurrently, therefore, the two 
 threads might both find the current maximumTimestamp is less than the 
 timestamp of its kv. After that, T1 and T2 will both set maximumTimestamp to 
 timestamp of its kv. If T1 set maximumTimestamp before T2 doing that, the 
 maximumTimestamp will be set to t2. Then, before any new update with bigger 
 timestamp has been applied to the MemStore, if we try to read out kv1 by 
 HTable#get and set the timestamp of 'Get' to t1, the StoreScanner will decide 
 whether the MemStoreScanner(imagining kv1 has not been flushed) should be 
 selected as candidate scanner by MemStoreScanner#shouldUseScanner. Then, the 
 MemStore won't be selected in MemStoreScanner#shouldUseScanner because 
 maximumTimestamp of the MemStore has been set to t2 (t2  t1). Consequently, 
 the written kv1 can't be read out and kv1 is lost from user's perspective.
 If the above analysis is right, after maximumTimestamp of 
 MemStore#timeRangeTracker has been set to t2, user will experience data lass 
 in the following situations:
 1. Before any new write with kv.timestamp  t1 has been add to the MemStore, 
 read request of kv1 with timestamp=t1 can not read kv1 out.
 2. Before any new write with kv.timestamp  t1 has been add to the MemStore, 
 if a flush happened, the data of MemStore will be flushed to StoreFile with 
 StoreFile#maximumTimestamp set to t2. After that, any read request with 
 timestamp=t1 can not read kv1 before next compaction(Actually, kv1.timestamp 
 might not be included in timeRange of the StoreFile even after compaction).
 The second situation is much more serious because the incorrect timeRange of 
 MemStore has been persisted to the file. 
 Similarly, the concurrent update of TimeRangeTracker#minimumTimestamp may 
 also cause this problem.
 As a simple way to fix the problem, we could add synchronized to 
 TimeRangeTracker#includeTimestamp so that this method won't be invoked 
 concurrently.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10598) Written data can not be read out because MemStore#timeRangeTracker might be updated concurrently

2014-02-24 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911125#comment-13911125
 ] 

Liang Xie commented on HBASE-10598:
---

raise to critical since this multi-thread issue intruduced a correctness 
problem...

 Written data can not be read out because MemStore#timeRangeTracker might be 
 updated concurrently
 

 Key: HBASE-10598
 URL: https://issues.apache.org/jira/browse/HBASE-10598
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.16
Reporter: cuijianwei
Assignee: cuijianwei
Priority: Critical
 Attachments: HBASE-10598-0.94.v1.patch


 In our test environment, we find written data can't be read out occasionally. 
 After debugging, we find that maximumTimestamp/minimumTimestamp of 
 MemStore#timeRangeTracker might decrease/increase when 
 MemStore#timeRangeTracker is updated concurrently, which might make the 
 MemStore/StoreFile to be filtered incorrectly when reading data out. Let's 
 see how the concurrent updating of timeRangeTracker#maximumTimestamp cause 
 this problem. 
 Imagining there are two threads T1 and T2 putting two KeyValues kv1 and kv2. 
 kv1 and kv2 belong to the same Store(so belong to the same region), but 
 contain different rowkeys. Consequently, kv1 and kv2 could be updated 
 concurrently. When we see the implementation of HRegionServer#multi, kv1 and 
 kv2 will be add to MemStore by HRegion#applyFamilyMapToMemstore in 
 HRegion#doMiniBatchMutation. Then, MemStore#internalAdd will be invoked and 
 MemStore#timeRangeTracker will be updated by 
 TimeRangeTracker#includeTimestamp as follows:
 {code}
   private void includeTimestamp(final long timestamp) {
  ...
 else if (maximumTimestamp  timestamp) {
   maximumTimestamp = timestamp;
 }
 return;
   }
 {code}
 Imagining the current maximumTimestamp of TimeRangeTracker is t0 before 
 includeTimestamp(...) invoked, kv1.timestamp=t1,  kv2.timestamp=t2, t1 and t2 
 are both set by user(then, user knows the timestamps of kv1 and kv2), and t1 
  t2  t0. T1 and T2 will be executed concurrently, therefore, the two 
 threads might both find the current maximumTimestamp is less than the 
 timestamp of its kv. After that, T1 and T2 will both set maximumTimestamp to 
 timestamp of its kv. If T1 set maximumTimestamp before T2 doing that, the 
 maximumTimestamp will be set to t2. Then, before any new update with bigger 
 timestamp has been applied to the MemStore, if we try to read out kv1 by 
 HTable#get and set the timestamp of 'Get' to t1, the StoreScanner will decide 
 whether the MemStoreScanner(imagining kv1 has not been flushed) should be 
 selected as candidate scanner by MemStoreScanner#shouldUseScanner. Then, the 
 MemStore won't be selected in MemStoreScanner#shouldUseScanner because 
 maximumTimestamp of the MemStore has been set to t2 (t2  t1). Consequently, 
 the written kv1 can't be read out and kv1 is lost from user's perspective.
 If the above analysis is right, after maximumTimestamp of 
 MemStore#timeRangeTracker has been set to t2, user will experience data lass 
 in the following situations:
 1. Before any new write with kv.timestamp  t1 has been add to the MemStore, 
 read request of kv1 with timestamp=t1 can not read kv1 out.
 2. Before any new write with kv.timestamp  t1 has been add to the MemStore, 
 if a flush happened, the data of MemStore will be flushed to StoreFile with 
 StoreFile#maximumTimestamp set to t2. After that, any read request with 
 timestamp=t1 can not read kv1 before next compaction(Actually, kv1.timestamp 
 might not be included in timeRange of the StoreFile even after compaction).
 The second situation is much more serious because the incorrect timeRange of 
 MemStore has been persisted to the file. 
 Similarly, the concurrent update of TimeRangeTracker#minimumTimestamp may 
 also cause this problem.
 As a simple way to fix the problem, we could add synchronized to 
 TimeRangeTracker#includeTimestamp so that this method won't be invoked 
 concurrently.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10411) [Book] Add a kerberos 'request is a replay (34)' issue at troubleshooting section

2014-02-16 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13902919#comment-13902919
 ] 

Liang Xie commented on HBASE-10411:
---

yep, when we used the modified KrbApReq class via -Xbootclasspath/p 
parameter, the request is a replay warning gone away at all(our krb5-server 
side disabled the replay caching already). just for your refer:)

 [Book] Add a kerberos 'request is a replay (34)' issue at troubleshooting 
 section
 -

 Key: HBASE-10411
 URL: https://issues.apache.org/jira/browse/HBASE-10411
 Project: HBase
  Issue Type: Improvement
  Components: documentation, security
Reporter: takeshi.miao
Assignee: takeshi.miao
Priority: Minor
 Attachments: HBASE-10411-trunk-v01.patch, HBASE-10411-v01.odt


 For kerberos 'request is a replay (34)' issue (HBASE-10379), adding it to the 
 troubleshooting section in HBase book



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10411) [Book] Add a kerberos 'request is a replay (34)' issue at troubleshooting section

2014-02-16 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13902920#comment-13902920
 ] 

Liang Xie commented on HBASE-10411:
---

yep, when we used the modified KrbApReq class via -Xbootclasspath/p 
parameter, the request is a replay warning gone away at all(our krb5-server 
side disabled the replay caching already). just for your refer:)

 [Book] Add a kerberos 'request is a replay (34)' issue at troubleshooting 
 section
 -

 Key: HBASE-10411
 URL: https://issues.apache.org/jira/browse/HBASE-10411
 Project: HBase
  Issue Type: Improvement
  Components: documentation, security
Reporter: takeshi.miao
Assignee: takeshi.miao
Priority: Minor
 Attachments: HBASE-10411-trunk-v01.patch, HBASE-10411-v01.odt


 For kerberos 'request is a replay (34)' issue (HBASE-10379), adding it to the 
 troubleshooting section in HBase book



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10411) [Book] Add a kerberos 'request is a replay (34)' issue at troubleshooting section

2014-02-16 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13902969#comment-13902969
 ] 

Liang Xie commented on HBASE-10411:
---

changing hrb5-server side was not done by me,  was our ops guys,  probably 
should be setting the env variable per document directly.

 [Book] Add a kerberos 'request is a replay (34)' issue at troubleshooting 
 section
 -

 Key: HBASE-10411
 URL: https://issues.apache.org/jira/browse/HBASE-10411
 Project: HBase
  Issue Type: Improvement
  Components: documentation, security
Reporter: takeshi.miao
Assignee: takeshi.miao
Priority: Minor
 Attachments: HBASE-10411-trunk-v01.patch, HBASE-10411-v01.odt


 For kerberos 'request is a replay (34)' issue (HBASE-10379), adding it to the 
 troubleshooting section in HBase book



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HBASE-10411) [Book] Add a kerberos 'request is a replay (34)' issue at troubleshooting section

2014-02-16 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13902985#comment-13902985
 ] 

Liang Xie commented on HBASE-10411:
---

np:)  it's just fine, i put here just in case other guys who hit the similar 
issue could search for refer. 
Go ahead, good job!

 [Book] Add a kerberos 'request is a replay (34)' issue at troubleshooting 
 section
 -

 Key: HBASE-10411
 URL: https://issues.apache.org/jira/browse/HBASE-10411
 Project: HBase
  Issue Type: Improvement
  Components: documentation, security
Reporter: takeshi.miao
Assignee: takeshi.miao
Priority: Minor
 Attachments: HBASE-10411-trunk-v01.patch, HBASE-10411-v01.odt


 For kerberos 'request is a replay (34)' issue (HBASE-10379), adding it to the 
 troubleshooting section in HBase book



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


  1   2   3   4   5   6   7   8   >