[jira] [Commented] (HBASE-14906) Improvements on FlushLargeStoresPolicy

2015-12-03 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039665#comment-15039665
 ] 

Yu Li commented on HBASE-14906:
---

Thanks for the review and comments Duo!

> Improvements on FlushLargeStoresPolicy
> --
>
> Key: HBASE-14906
> URL: https://issues.apache.org/jira/browse/HBASE-14906
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0
>
> Attachments: HBASE-14906.patch, HBASE-14906.v2.patch, 
> HBASE-14906.v3.patch, HBASE-14906.v4.patch, HBASE-14906.v4.patch
>
>
> When checking FlushLargeStoragePolicy, found below possible improving points:
> 1. Currently in selectStoresToFlush, we will do the selection no matter how 
> many actual families, which is not necessary for one single family
> 2. Default value for hbase.hregion.percolumnfamilyflush.size.lower.bound 
> could not fit in all cases, and requires user to know details of the 
> implementation to properly set it. We propose to use 
> "hbase.hregion.memstore.flush.size/column_family_number" instead:
> {noformat}
>   
> hbase.hregion.percolumnfamilyflush.size.lower.bound
> 16777216
> 
> If FlushLargeStoresPolicy is used and there are multiple column families,
> then every time that we hit the total memstore limit, we find out all the
> column families whose memstores exceed a "lower bound" and only flush them
> while retaining the others in memory. The "lower bound" will be
> "hbase.hregion.memstore.flush.size / column_family_number" by default
> unless value of this property is larger than that. If none of the families
> have their memstore size more than lower bound, all the memstores will be
> flushed (just as usual).
> 
>   
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations

2015-12-03 Thread Ted Malaska (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039668#comment-15039668
 ] 

Ted Malaska commented on HBASE-14795:
-

Can we open up a review board for this. 

Thx

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14926) Hung ThriftServer; no timeout on read from client; if client crashes, worker thread gets stuck reading

2015-12-03 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-14926:
--
Attachment: 14926.patch

Patch for thrift1 and thrift2 (of course the server implementations are 
different). Timeout seems to only work for TBoundedThreadPoolServer.

Fixed up the examples doc too for thrift (It baffled me for a while)

I'm a bit stuck on how to manufacture this circumstance in a test; i'd have to 
kill the client exactly where the server is doing a read... any ideas?.

> Hung ThriftServer; no timeout on read from client; if client crashes, worker 
> thread gets stuck reading
> --
>
> Key: HBASE-14926
> URL: https://issues.apache.org/jira/browse/HBASE-14926
> Project: HBase
>  Issue Type: Bug
>  Components: Thrift
>Affects Versions: 2.0.0, 1.2.0, 1.1.2, 1.3.0, 1.0.3, 0.98.16
>Reporter: stack
> Attachments: 14926.patch
>
>
> Thrift server is hung. All worker threads are doing this:
> {code}
> "thrift-worker-0" daemon prio=10 tid=0x7f0bb95c2800 nid=0xf6a7 runnable 
> [0x7f0b956e]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> - locked <0x00066d859490> (a java.io.BufferedInputStream)
> at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at 
> org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
> at 
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at 
> org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:601)
> at 
> org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:470)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
> at 
> org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)
> at 
> org.apache.hadoop.hbase.thrift.CallQueue$Call.run(CallQueue.java:64)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> They never recover.
> I don't have client side logs.
> We've been here before: HBASE-4967 "connected client thrift sockets should 
> have a server side read timeout" but this patch only got applied to fb branch 
> (and thrift has changed since then).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14926) Hung ThriftServer; no timeout on read from client; if client crashes, worker thread gets stuck reading

2015-12-03 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-14926:
--
 Assignee: stack
Affects Version/s: 1.3.0
   1.2.0
   2.0.0
   1.1.2
   1.0.3
   0.98.16
   Status: Patch Available  (was: Open)

> Hung ThriftServer; no timeout on read from client; if client crashes, worker 
> thread gets stuck reading
> --
>
> Key: HBASE-14926
> URL: https://issues.apache.org/jira/browse/HBASE-14926
> Project: HBase
>  Issue Type: Bug
>  Components: Thrift
>Affects Versions: 0.98.16, 1.0.3, 1.1.2, 2.0.0, 1.2.0, 1.3.0
>Reporter: stack
>Assignee: stack
> Attachments: 14926.patch
>
>
> Thrift server is hung. All worker threads are doing this:
> {code}
> "thrift-worker-0" daemon prio=10 tid=0x7f0bb95c2800 nid=0xf6a7 runnable 
> [0x7f0b956e]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> - locked <0x00066d859490> (a java.io.BufferedInputStream)
> at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at 
> org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
> at 
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at 
> org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:601)
> at 
> org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:470)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
> at 
> org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)
> at 
> org.apache.hadoop.hbase.thrift.CallQueue$Call.run(CallQueue.java:64)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> They never recover.
> I don't have client side logs.
> We've been here before: HBASE-4967 "connected client thrift sockets should 
> have a server side read timeout" but this patch only got applied to fb branch 
> (and thrift has changed since then).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14904) Mark Base[En|De]coder LimitedPrivate and fix binary compat issue

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041146#comment-15041146
 ] 

Hudson commented on HBASE-14904:


SUCCESS: Integrated in HBase-1.3 #413 (See 
[https://builds.apache.org/job/HBase-1.3/413/])
HBASE-14904 Mark Base[En|De]coder LimitedPrivate and fix binary compat (enis: 
rev edb8edfeb3564152dfacac0e5fe71ba295df821e)
* hbase-common/src/main/java/org/apache/hadoop/hbase/codec/BaseDecoder.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressionContext.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/codec/BaseEncoder.java


> Mark Base[En|De]coder LimitedPrivate and fix binary compat issue
> 
>
> Key: HBASE-14904
> URL: https://issues.apache.org/jira/browse/HBASE-14904
> Project: HBase
>  Issue Type: Bug
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.3, 0.98.17, 1.0.4
>
> Attachments: hbase-14904_v1.patch, hbase-14904_v2.patch
>
>
> PHOENIX-2477 revealed that the changes from HBASE-14501 breaks binary 
> compatibility in Phoenix compiled with earlier versions of HBase and run 
> agains later versions. 
> This is one of the areas that the boundary is not clear, but it won't hurt us 
> to fix it. 
> The exception trace is: 
> {code}
> Exception in thread "main" java.lang.NoSuchFieldError: in
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$PhoenixBaseDecoder.(IndexedWALEditCodec.java:106)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueDecoder.(IndexedWALEditCodec.java:121)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec.getDecoder(IndexedWALEditCodec.java:63)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:292)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:82)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:148)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:316)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:281)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:269)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:418)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.processFile(WALPrettyPrinter.java:247)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.run(WALPrettyPrinter.java:422)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.main(WALPrettyPrinter.java:357)
> {code}
> Although {{BaseDecoder.in}} is still there, it got changed to be a class 
> rather than an interface. BaseDecoder is marked Private, thus the binary 
> compat check is not run at all. Not sure whether it would have caught this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Heng Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041145#comment-15041145
 ] 

Heng Chen commented on HBASE-14790:
---

Make sense... 
Let's just keep here as original.  We can only realize 'acked length' logic,  
it could fix HBASE-14004 already. 
 As performance improvement work, keep going yours fanout Stream here. 
Thoughts? [~Apache9]

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14822) Renewing leases of scanners doesn't work

2015-12-03 Thread Lars Hofhansl (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-14822:
--
Attachment: 14822-v4-0.98.txt

0.98 version that simply adds a new PB flag.
Master version soon.

> Renewing leases of scanners doesn't work
> 
>
> Key: HBASE-14822
> URL: https://issues.apache.org/jira/browse/HBASE-14822
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.98.14
>Reporter: Samarth Jain
>Assignee: Lars Hofhansl
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.3, 0.98.17, 1.0.4
>
> Attachments: 14822-0.98-v2.txt, 14822-0.98-v3.txt, 14822-0.98.txt, 
> 14822-v3-0.98.txt, 14822-v4-0.98.txt, 14822.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14904) Mark Base[En|De]coder LimitedPrivate and fix binary compat issue

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041228#comment-15041228
 ] 

Hudson commented on HBASE-14904:


FAILURE: Integrated in HBase-Trunk_matrix #530 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/530/])
HBASE-14904 Mark Base[En|De]coder LimitedPrivate and fix binary compat (enis: 
rev b3260423b1f59a0af80f5938339997569c3eb21a)
* hbase-common/src/main/java/org/apache/hadoop/hbase/codec/BaseEncoder.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressionContext.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/codec/BaseDecoder.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java


> Mark Base[En|De]coder LimitedPrivate and fix binary compat issue
> 
>
> Key: HBASE-14904
> URL: https://issues.apache.org/jira/browse/HBASE-14904
> Project: HBase
>  Issue Type: Bug
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.3, 0.98.17, 1.0.4
>
> Attachments: hbase-14904_v1.patch, hbase-14904_v2.patch
>
>
> PHOENIX-2477 revealed that the changes from HBASE-14501 breaks binary 
> compatibility in Phoenix compiled with earlier versions of HBase and run 
> agains later versions. 
> This is one of the areas that the boundary is not clear, but it won't hurt us 
> to fix it. 
> The exception trace is: 
> {code}
> Exception in thread "main" java.lang.NoSuchFieldError: in
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$PhoenixBaseDecoder.(IndexedWALEditCodec.java:106)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueDecoder.(IndexedWALEditCodec.java:121)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec.getDecoder(IndexedWALEditCodec.java:63)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:292)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:82)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:148)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:316)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:281)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:269)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:418)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.processFile(WALPrettyPrinter.java:247)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.run(WALPrettyPrinter.java:422)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.main(WALPrettyPrinter.java:357)
> {code}
> Although {{BaseDecoder.in}} is still there, it got changed to be a class 
> rather than an interface. BaseDecoder is marked Private, thus the binary 
> compat check is not run at all. Not sure whether it would have caught this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13082) Coarsen StoreScanner locks to RegionScanner

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041227#comment-15041227
 ] 

Hudson commented on HBASE-13082:


FAILURE: Integrated in HBase-Trunk_matrix #530 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/530/])
HBASE-13082 Coarsen StoreScanner locks to RegionScanner (Ram) (ramkrishna: rev 
8b3d1f144408e4a7a014c5ac46418c9e91b9b0db)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionMergeTransactionOnCluster.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionReplayEvents.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/cleaner/TestSnapshotFromMaster.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/backup/example/TestZooKeeperTableArchiveClient.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StripeStoreFileManager.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestEncryptionKeyRotation.java
* hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStore.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HStore.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/MockStoreFile.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestStripeStoreFileManager.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/compactions/TestCompactedHFilesDischarger.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/CompactedHFilesDischarger.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionReplicas.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileManager.java
* hbase-server/src/test/java/org/apache/hadoop/hbase/io/TestHeapSize.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/ReversedStoreScanner.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/DefaultStoreFileManager.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestHFileOutputFormat2.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
* hbase-server/src/test/java/org/apache/hadoop/hbase/TestIOFencing.java


> Coarsen StoreScanner locks to RegionScanner
> ---
>
> Key: HBASE-13082
> URL: https://issues.apache.org/jira/browse/HBASE-13082
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: ramkrishna.s.vasudevan
> Fix For: 2.0.0
>
> Attachments: 13082-test.txt, 13082-v2.txt, 13082-v3.txt, 
> 13082-v4.txt, 13082.txt, 13082.txt, HBASE-13082.pdf, HBASE-13082_1.pdf, 
> HBASE-13082_12.patch, HBASE-13082_13.patch, HBASE-13082_14.patch, 
> HBASE-13082_15.patch, HBASE-13082_16.patch, HBASE-13082_17.patch, 
> HBASE-13082_18.patch, HBASE-13082_19.patch, HBASE-13082_1_WIP.patch, 
> HBASE-13082_2.pdf, HBASE-13082_2_WIP.patch, HBASE-13082_3.patch, 
> HBASE-13082_4.patch, HBASE-13082_9.patch, HBASE-13082_9.patch, 
> HBASE-13082_withoutpatch.jpg, HBASE-13082_withpatch.jpg, 
> LockVsSynchronized.java, gc.png, gc.png, gc.png, hits.png, next.png, next.png
>
>
> Continuing where HBASE-10015 left of.
> We can avoid locking (and memory fencing) inside StoreScanner by deferring to 
> the lock already held by the RegionScanner.
> In tests this shows quite a scan improvement and reduced CPU (the fences make 
> the cores wait for memory fetches).
> There are some drawbacks too:
> * All calls to RegionScanner need to be remain synchronized
> * Implementors of coprocessors need to be diligent in following the locking 
> contract. For example Phoenix does not lock RegionScanner.nextRaw() and 
> required in the documentation (not picking on Phoenix, this one is my fault 
> as I told them it's OK)
> * possible starving of flushes and compaction with heavy read load. 
> RegionScanner operations would keep getting the locks and the 
> flushes/compactions would not be able finalize the set of files.
> I'll have a patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14904) Mark Base[En|De]coder LimitedPrivate and fix binary compat issue

2015-12-03 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-14904:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Pushed this to 0.98+. Thanks for looking. 

> Mark Base[En|De]coder LimitedPrivate and fix binary compat issue
> 
>
> Key: HBASE-14904
> URL: https://issues.apache.org/jira/browse/HBASE-14904
> Project: HBase
>  Issue Type: Bug
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.3, 0.98.17, 1.0.4
>
> Attachments: hbase-14904_v1.patch, hbase-14904_v2.patch
>
>
> PHOENIX-2477 revealed that the changes from HBASE-14501 breaks binary 
> compatibility in Phoenix compiled with earlier versions of HBase and run 
> agains later versions. 
> This is one of the areas that the boundary is not clear, but it won't hurt us 
> to fix it. 
> The exception trace is: 
> {code}
> Exception in thread "main" java.lang.NoSuchFieldError: in
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$PhoenixBaseDecoder.(IndexedWALEditCodec.java:106)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueDecoder.(IndexedWALEditCodec.java:121)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec.getDecoder(IndexedWALEditCodec.java:63)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:292)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:82)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:148)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:316)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:281)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:269)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:418)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.processFile(WALPrettyPrinter.java:247)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.run(WALPrettyPrinter.java:422)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.main(WALPrettyPrinter.java:357)
> {code}
> Although {{BaseDecoder.in}} is still there, it got changed to be a class 
> rather than an interface. BaseDecoder is marked Private, thus the binary 
> compat check is not run at all. Not sure whether it would have caught this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14869) Better request latency histograms

2015-12-03 Thread Vikas Vishwakarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15040776#comment-15040776
 ] 

Vikas Vishwakarma commented on HBASE-14869:
---

For the metrics I am using _SizeRangeCount_ and _TimeRangeCount_ appended to 
each metric so it is easy to identify Range metrics based on these fixed 
patterns that will differentiate it from all other metrics. Also based on 
/Size/ and /Time/ match it will be easy to process the metrics accordingly as 
time or size metric

> Better request latency histograms
> -
>
> Key: HBASE-14869
> URL: https://issues.apache.org/jira/browse/HBASE-14869
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Vikas Vishwakarma
> Fix For: 2.0.0, 1.3.0, 0.98.17
>
> Attachments: 14869-test-0.98.txt, 14869-v1-0.98.txt, 
> 14869-v1-2.0.txt, 14869-v2-0.98.txt, 14869-v2-2.0.txt, 14869-v3-0.98.txt, 
> 14869-v4-0.98.txt, 14869-v5-0.98.txt, AppendSizeTime.png, Get.png
>
>
> I just discussed this with a colleague.
> The get, put, etc, histograms that each region server keeps are somewhat 
> useless (depending on what you want to achieve of course), as they are 
> aggregated and calculated by each region server.
> It would be better to record the number of requests in certainly latency 
> bands in addition to what we do now.
> For example the number of gets that took 0-5ms, 6-10ms, 10-20ms, 20-50ms, 
> 50-100ms, 100-1000ms, > 1000ms, etc. (just as an example, should be 
> configurable).
> That way we can do further calculations after the fact, and answer questions 
> like: How often did we miss our SLA? Percentage of requests that missed an 
> SLA, etc.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Heng Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041196#comment-15041196
 ] 

Heng Chen commented on HBASE-14790:
---

{quote}
DataStreamer#block tracks the "number of bytes acked". It is returned by 
DFSOutputStream#getBlock
{quote}
Bad news...  This method is not public [~zhz]  

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14926) Hung ThriftServer; no timeout on read from client; if client crashes, worker thread gets stuck reading

2015-12-03 Thread stack (JIRA)
stack created HBASE-14926:
-

 Summary: Hung ThriftServer; no timeout on read from client; if 
client crashes, worker thread gets stuck reading
 Key: HBASE-14926
 URL: https://issues.apache.org/jira/browse/HBASE-14926
 Project: HBase
  Issue Type: Bug
  Components: Thrift
Reporter: stack


Thrift server is hung. All worker threads are doing this:

{code}
"thrift-worker-0" daemon prio=10 tid=0x7f0bb95c2800 nid=0xf6a7 runnable 
[0x7f0b956e]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0x00066d859490> (a java.io.BufferedInputStream)
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at 
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:601)
at 
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:470)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
at 
org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)
at org.apache.hadoop.hbase.thrift.CallQueue$Call.run(CallQueue.java:64)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

They never recover.

I don't have client side logs.

We've been here before: HBASE-4967 "connected client thrift sockets should have 
a server side read timeout" but this patch only got applied to fb branch (and 
thrift has changed since then).





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13857) Slow WAL Append count in ServerMetricsTmpl.jamon is hardcoded to zero

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039698#comment-15039698
 ] 

Hudson commented on HBASE-13857:


FAILURE: Integrated in HBase-Trunk_matrix #529 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/529/])
HBASE-13857 Slow WAL Append count in ServerMetricsTmpl.jamon is (stack: rev 
51503efcf05be734c14200233d5f1495e4c2c3f1)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MetricsRegionServerWrapperImpl.java
* 
hbase-server/src/main/jamon/org/apache/hadoop/hbase/tmpl/regionserver/ServerMetricsTmpl.jamon
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestMetricsWAL.java
* 
hbase-hadoop-compat/src/main/java/org/apache/hadoop/hbase/regionserver/MetricsRegionServerWrapper.java
* 
hbase-hadoop2-compat/src/main/java/org/apache/hadoop/hbase/regionserver/wal/MetricsWALSourceImpl.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/MetricsRegionServerWrapperStub.java
* 
hbase-hadoop-compat/src/main/java/org/apache/hadoop/hbase/regionserver/wal/MetricsWALSource.java


> Slow WAL Append count in ServerMetricsTmpl.jamon is hardcoded to zero
> -
>
> Key: HBASE-13857
> URL: https://issues.apache.org/jira/browse/HBASE-13857
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, UI
>Affects Versions: 0.98.0
>Reporter: Lars George
>Assignee: Vrishal Kulkarni
>  Labels: beginner
> Fix For: 2.0.0
>
> Attachments: HBASE-13857.patch
>
>
> The template has this:
> {noformat}
>  
> ...
> Slow WAL Append Count
> 
> 
> 
> <% 0 %>
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039718#comment-15039718
 ] 

Duo Zhang commented on HBASE-14790:
---

{quote}
2. dn1 received the WAL entry, and it is read by ReplicationSource and 
replicated to slave cluster.
3. dn1 and rs both crash, dn2 and dn3 has not received this WAL entry yet, and 
rs has not bumped the GS of this block yet.
4. NameNode complete the file with a length that does not contains this WAL 
entry since the GS of blocks on dn2 and dn3 is correct and NameNode does not 
know there used to be a block with longer length.
{quote}
In a fan out implementation, this problem is obvious but in a pipelined 
implementation it is not that straight-forward and I used to think I was wrong 
and this could not happen in a pipelined implementation. The data can only be 
visible on datanode only after it receives the downstream ack. So if the 
pipeline is dn1->dn2->dn3, then dn3 is the first datanode that make a data 
visible to client and usually we think the data should also be written to dn1 
and dn2. But maybe for performance reason, {{BlockReceiver}} sends a packet to 
downstream mirror before writing it to local disk. So it could happen that dn3 
make the data visible and read by client, but dn1 and dn2 crash before writing 
data to local disk. Then let us kill the client and dn3, and restart dn1 and 
dn2, whoops...

And I had a discussion with my workmate [~yangzhe1991], we think that if we 
allow duplicate WAL entries in HBase, then the pipeline recovery part could 
also be moved to a background thread. We could just rewrite the WAL entries 
after acked point to the new file, this could also reduce the recovery latency.

And for keeping an "acked length", I think we could make use of the fsync 
method in HDFS. We could call fsync asynchronously to update length on 
namenode. The replication source should not read beyond the length gotten from 
namenode(do not trust the visible length read from datanode). The advantage 
here is when region server crashes, we could still get this value from 
namenode, and the file will be closed eventually by someone so the length will 
finally be correct.

Thanks.

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14869) Better request latency histograms

2015-12-03 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039719#comment-15039719
 ] 

Andrew Purtell commented on HBASE-14869:


Thanks Vikas. It would be a shame if we would wish to tweak the naming after 
this is committed, that's all. Not worried about more than that. 

> Better request latency histograms
> -
>
> Key: HBASE-14869
> URL: https://issues.apache.org/jira/browse/HBASE-14869
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Vikas Vishwakarma
> Fix For: 2.0.0, 1.3.0, 0.98.17
>
> Attachments: 14869-test-0.98.txt, 14869-v1-0.98.txt, 
> 14869-v1-2.0.txt, 14869-v2-0.98.txt, 14869-v2-2.0.txt, 14869-v3-0.98.txt, 
> 14869-v4-0.98.txt, 14869-v5-0.98.txt, AppendSizeTime.png, Get.png
>
>
> I just discussed this with a colleague.
> The get, put, etc, histograms that each region server keeps are somewhat 
> useless (depending on what you want to achieve of course), as they are 
> aggregated and calculated by each region server.
> It would be better to record the number of requests in certainly latency 
> bands in addition to what we do now.
> For example the number of gets that took 0-5ms, 6-10ms, 10-20ms, 20-50ms, 
> 50-100ms, 100-1000ms, > 1000ms, etc. (just as an example, should be 
> configurable).
> That way we can do further calculations after the fact, and answer questions 
> like: How often did we miss our SLA? Percentage of requests that missed an 
> SLA, etc.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14869) Better request latency histograms

2015-12-03 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039738#comment-15039738
 ] 

Lars Hofhansl commented on HBASE-14869:
---

Cool. Metric name's the only open issue. If nobody else chimes in, I'm good 
committing.

Maybe [~vik.karma] can report how hard it was to make sense of these new metric 
in the automated scripts.
In the end any naming is probably fine. The main part I wasn't sure about was 
the "greater than X" naming.
Recall this scheme: "Get_0-1", "Get_1-3", "Get_10-30" , etc, and "Get_>60"

> Better request latency histograms
> -
>
> Key: HBASE-14869
> URL: https://issues.apache.org/jira/browse/HBASE-14869
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Vikas Vishwakarma
> Fix For: 2.0.0, 1.3.0, 0.98.17
>
> Attachments: 14869-test-0.98.txt, 14869-v1-0.98.txt, 
> 14869-v1-2.0.txt, 14869-v2-0.98.txt, 14869-v2-2.0.txt, 14869-v3-0.98.txt, 
> 14869-v4-0.98.txt, 14869-v5-0.98.txt, AppendSizeTime.png, Get.png
>
>
> I just discussed this with a colleague.
> The get, put, etc, histograms that each region server keeps are somewhat 
> useless (depending on what you want to achieve of course), as they are 
> aggregated and calculated by each region server.
> It would be better to record the number of requests in certainly latency 
> bands in addition to what we do now.
> For example the number of gets that took 0-5ms, 6-10ms, 10-20ms, 20-50ms, 
> 50-100ms, 100-1000ms, > 1000ms, etc. (just as an example, should be 
> configurable).
> That way we can do further calculations after the fact, and answer questions 
> like: How often did we miss our SLA? Percentage of requests that missed an 
> SLA, etc.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14904) Mark Base[En|De]coder LimitedPrivate and fix binary compat issue

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15040796#comment-15040796
 ] 

Hudson commented on HBASE-14904:


SUCCESS: Integrated in HBase-1.3-IT #354 (See 
[https://builds.apache.org/job/HBase-1.3-IT/354/])
HBASE-14904 Mark Base[En|De]coder LimitedPrivate and fix binary compat (enis: 
rev edb8edfeb3564152dfacac0e5fe71ba295df821e)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressionContext.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/codec/BaseDecoder.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/codec/BaseEncoder.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java


> Mark Base[En|De]coder LimitedPrivate and fix binary compat issue
> 
>
> Key: HBASE-14904
> URL: https://issues.apache.org/jira/browse/HBASE-14904
> Project: HBase
>  Issue Type: Bug
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.3, 0.98.17, 1.0.4
>
> Attachments: hbase-14904_v1.patch, hbase-14904_v2.patch
>
>
> PHOENIX-2477 revealed that the changes from HBASE-14501 breaks binary 
> compatibility in Phoenix compiled with earlier versions of HBase and run 
> agains later versions. 
> This is one of the areas that the boundary is not clear, but it won't hurt us 
> to fix it. 
> The exception trace is: 
> {code}
> Exception in thread "main" java.lang.NoSuchFieldError: in
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$PhoenixBaseDecoder.(IndexedWALEditCodec.java:106)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueDecoder.(IndexedWALEditCodec.java:121)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec.getDecoder(IndexedWALEditCodec.java:63)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:292)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:82)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:148)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:316)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:281)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:269)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:418)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.processFile(WALPrettyPrinter.java:247)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.run(WALPrettyPrinter.java:422)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.main(WALPrettyPrinter.java:357)
> {code}
> Although {{BaseDecoder.in}} is still there, it got changed to be a class 
> rather than an interface. BaseDecoder is marked Private, thus the binary 
> compat check is not run at all. Not sure whether it would have caught this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13082) Coarsen StoreScanner locks to RegionScanner

2015-12-03 Thread ramkrishna.s.vasudevan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-13082:
---
Fix Version/s: 2.0.0

> Coarsen StoreScanner locks to RegionScanner
> ---
>
> Key: HBASE-13082
> URL: https://issues.apache.org/jira/browse/HBASE-13082
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: ramkrishna.s.vasudevan
> Fix For: 2.0.0
>
> Attachments: 13082-test.txt, 13082-v2.txt, 13082-v3.txt, 
> 13082-v4.txt, 13082.txt, 13082.txt, HBASE-13082.pdf, HBASE-13082_1.pdf, 
> HBASE-13082_12.patch, HBASE-13082_13.patch, HBASE-13082_14.patch, 
> HBASE-13082_15.patch, HBASE-13082_16.patch, HBASE-13082_17.patch, 
> HBASE-13082_18.patch, HBASE-13082_19.patch, HBASE-13082_1_WIP.patch, 
> HBASE-13082_2.pdf, HBASE-13082_2_WIP.patch, HBASE-13082_3.patch, 
> HBASE-13082_4.patch, HBASE-13082_9.patch, HBASE-13082_9.patch, 
> HBASE-13082_withoutpatch.jpg, HBASE-13082_withpatch.jpg, 
> LockVsSynchronized.java, gc.png, gc.png, gc.png, hits.png, next.png, next.png
>
>
> Continuing where HBASE-10015 left of.
> We can avoid locking (and memory fencing) inside StoreScanner by deferring to 
> the lock already held by the RegionScanner.
> In tests this shows quite a scan improvement and reduced CPU (the fences make 
> the cores wait for memory fetches).
> There are some drawbacks too:
> * All calls to RegionScanner need to be remain synchronized
> * Implementors of coprocessors need to be diligent in following the locking 
> contract. For example Phoenix does not lock RegionScanner.nextRaw() and 
> required in the documentation (not picking on Phoenix, this one is my fault 
> as I told them it's OK)
> * possible starving of flushes and compaction with heavy read load. 
> RegionScanner operations would keep getting the locks and the 
> flushes/compactions would not be able finalize the set of files.
> I'll have a patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13082) Coarsen StoreScanner locks to RegionScanner

2015-12-03 Thread ramkrishna.s.vasudevan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-13082:
---
Release Note: After this JIRA we will not be doing any scanner reset after 
compaction during a course of a scan. The files that were compacted will still 
be continued to be used in the scan process. The compacted files will be 
archived by a background thread that runs every 2 mins by default only when 
there are no active scanners on those comapcted files. The above duration can 
be controlled using the knob 'hbase.hfile.compactions.cleaner.interval'. 

> Coarsen StoreScanner locks to RegionScanner
> ---
>
> Key: HBASE-13082
> URL: https://issues.apache.org/jira/browse/HBASE-13082
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: ramkrishna.s.vasudevan
> Fix For: 2.0.0
>
> Attachments: 13082-test.txt, 13082-v2.txt, 13082-v3.txt, 
> 13082-v4.txt, 13082.txt, 13082.txt, HBASE-13082.pdf, HBASE-13082_1.pdf, 
> HBASE-13082_12.patch, HBASE-13082_13.patch, HBASE-13082_14.patch, 
> HBASE-13082_15.patch, HBASE-13082_16.patch, HBASE-13082_17.patch, 
> HBASE-13082_18.patch, HBASE-13082_19.patch, HBASE-13082_1_WIP.patch, 
> HBASE-13082_2.pdf, HBASE-13082_2_WIP.patch, HBASE-13082_3.patch, 
> HBASE-13082_4.patch, HBASE-13082_9.patch, HBASE-13082_9.patch, 
> HBASE-13082_withoutpatch.jpg, HBASE-13082_withpatch.jpg, 
> LockVsSynchronized.java, gc.png, gc.png, gc.png, hits.png, next.png, next.png
>
>
> Continuing where HBASE-10015 left of.
> We can avoid locking (and memory fencing) inside StoreScanner by deferring to 
> the lock already held by the RegionScanner.
> In tests this shows quite a scan improvement and reduced CPU (the fences make 
> the cores wait for memory fetches).
> There are some drawbacks too:
> * All calls to RegionScanner need to be remain synchronized
> * Implementors of coprocessors need to be diligent in following the locking 
> contract. For example Phoenix does not lock RegionScanner.nextRaw() and 
> required in the documentation (not picking on Phoenix, this one is my fault 
> as I told them it's OK)
> * possible starving of flushes and compaction with heavy read load. 
> RegionScanner operations would keep getting the locks and the 
> flushes/compactions would not be able finalize the set of files.
> I'll have a patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041185#comment-15041185
 ] 

stack commented on HBASE-14790:
---

bq. ReplicationSource should ask this length first before reading and do not 
read beyond it. If we have this logic, 

Doing this would be an improvement over current way we do replication -- less 
NN ops -- where we open the file, read till EOF, close, then do same again to 
see if anything new has been added to the file.

bq. ...we could reset the acked length if needed and then move the remaining 
operations of closing file to a background thread to reduce latency. Thoughts? 
stack

This is clean up of a broken WAL? This is being able to ask each DN what it 
thinks the length is? While this is going on, we would be holding on to the 
hbase handlers not letting response go back to the client?  Would we have to do 
some weird accounting where three clients A, B, and C and each written an edit, 
and then the length we get back from exisiting DNs after a crash say does not 
include the edit written by client C... we'll have to figure out how to fail 
client C's write (though we'd moved on from append and were trying to 
sync/hflush the append)?

bq. We could just rewrite the WAL entries after acked point to the new file, 
this could also reduce the recovery latency.

I think we can do this currently in the multi WAL case... would have to check 
(or at least one implementation that may not be the one that landed, used to do 
this). It would keep around the edits because it would have a standby WAL and 
if the current WAL was 'slow', we'd throw it away and then add the outstanding 
edits to the new WAL and away we go again (I can dig it up... )

bq. The replication source should not read beyond the length gotten from 
namenode(do not trust the visible length read from datanode). 

This would be lots of NN ops? (In a subsequent comment you say this... nvm)

bq. The advantage here is when region server crashes, we could still get this 
value from namenode, and the file will be closed eventually by someone so the 
length will finally be correct.

This would be sweet though (could do away with keeping replication lengths up 
in zk?)

bq. There will always be some situation that we could not know there is data 
loss unless we call fsync every time to update length on namenode when writing 
WAL I think. 

Yes. This is the case before your patch though. We should also get some 
experience of what its like trying fsync.'d WAL...











> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14905) VerifyReplication does not honour versions option

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039688#comment-15039688
 ] 

Hudson commented on HBASE-14905:


FAILURE: Integrated in HBase-1.3-IT #353 (See 
[https://builds.apache.org/job/HBase-1.3-IT/353/])
HBASE-14905 VerifyReplication does not honour versions option (Vishal (tedyu: 
rev b001019d9bca43586de13bd7df72235d56d36503)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java


> VerifyReplication does not honour versions option
> -
>
> Key: HBASE-14905
> URL: https://issues.apache.org/jira/browse/HBASE-14905
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Affects Versions: 2.0.0, 0.98.16
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: 14905-v2.txt, HBASE-14905.patch, HBASE-14905_v3.patch, 
> HBASE-14905_v4.patch, test.patch
>
>
> source:
> hbase(main):001:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
>
> target:
> hbase(main):023:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449030090758, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029984282, 
> value=value   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
> /bin/hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication 
> --versions=100 1 t1
> org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier$Counters
>   GOODROWS=1
> Does not show any mismatch. Ideally it should show. This is because in 
> VerifyReplication Class maxVersion is not correctly set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-13082) Coarsen StoreScanner locks to RegionScanner

2015-12-03 Thread ramkrishna.s.vasudevan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-13082:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Pushed to master. Thanks for the reviews [~stack] and [~anoop.hbase] and others 
for providing feedback on the patch.

> Coarsen StoreScanner locks to RegionScanner
> ---
>
> Key: HBASE-13082
> URL: https://issues.apache.org/jira/browse/HBASE-13082
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: ramkrishna.s.vasudevan
> Attachments: 13082-test.txt, 13082-v2.txt, 13082-v3.txt, 
> 13082-v4.txt, 13082.txt, 13082.txt, HBASE-13082.pdf, HBASE-13082_1.pdf, 
> HBASE-13082_12.patch, HBASE-13082_13.patch, HBASE-13082_14.patch, 
> HBASE-13082_15.patch, HBASE-13082_16.patch, HBASE-13082_17.patch, 
> HBASE-13082_18.patch, HBASE-13082_19.patch, HBASE-13082_1_WIP.patch, 
> HBASE-13082_2.pdf, HBASE-13082_2_WIP.patch, HBASE-13082_3.patch, 
> HBASE-13082_4.patch, HBASE-13082_9.patch, HBASE-13082_9.patch, 
> HBASE-13082_withoutpatch.jpg, HBASE-13082_withpatch.jpg, 
> LockVsSynchronized.java, gc.png, gc.png, gc.png, hits.png, next.png, next.png
>
>
> Continuing where HBASE-10015 left of.
> We can avoid locking (and memory fencing) inside StoreScanner by deferring to 
> the lock already held by the RegionScanner.
> In tests this shows quite a scan improvement and reduced CPU (the fences make 
> the cores wait for memory fetches).
> There are some drawbacks too:
> * All calls to RegionScanner need to be remain synchronized
> * Implementors of coprocessors need to be diligent in following the locking 
> contract. For example Phoenix does not lock RegionScanner.nextRaw() and 
> required in the documentation (not picking on Phoenix, this one is my fault 
> as I told them it's OK)
> * possible starving of flushes and compaction with heavy read load. 
> RegionScanner operations would keep getting the locks and the 
> flushes/compactions would not be able finalize the set of files.
> I'll have a patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Heng Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15040781#comment-15040781
 ] 

Heng Chen commented on HBASE-14790:
---

{quote}
And for keeping an "acked length", I think we could make use of the fsync 
method in HDFS. We could call fsync asynchronously to update length on 
namenode. The replication source should not read beyond the length gotten from 
namenode(do not trust the visible length read from datanode). 
{quote}

So if we can not avoid fsync every time, maybe this way [~Apache9] mentioned is 
the best solution?   Shall we begin?

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14822) Renewing leases of scanners doesn't work

2015-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041167#comment-15041167
 ] 

Hadoop QA commented on HBASE-14822:
---

{color:red}-1 overall{color}.  

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16763//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16763//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16763//artifact/patchprocess/checkstyle-aggregate.html

  Javadoc warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16763//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16763//console

This message is automatically generated.

> Renewing leases of scanners doesn't work
> 
>
> Key: HBASE-14822
> URL: https://issues.apache.org/jira/browse/HBASE-14822
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.98.14
>Reporter: Samarth Jain
>Assignee: Lars Hofhansl
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.3, 0.98.17, 1.0.4
>
> Attachments: 14822-0.98-v2.txt, 14822-0.98-v3.txt, 14822-0.98.txt, 
> 14822-v3-0.98.txt, 14822-v4-0.98.txt, 14822.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Phil Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041234#comment-15041234
 ] 

Phil Yang commented on HBASE-14790:
---

Considering these features:
Hflush is much faster than hsync, especially in pipeline mode. So we have to 
use hflush for hbase writing.
The data in DN that is hflushed but not hsynced may only in memory not disk, 
but it can be read by client.

So if we hflush data to DNs, and it is read by ReplicationSource and 
transferred to slave cluster, then three DNs and RS in master cluster crash. 
And after replaying WALs, slave will have data that master loses...

The only way to prevent any data losses is hsync every time but it is too slow, 
and I think most users can bear data lose to speed up writing operation but can 
not bear slave has more data than master.

Therefore, I think we can do these:
hflush every time, not fsync;
hfsync periodically, for example, default per 1000ms? It can be configured by 
users, and users can also configure that we hfsync each time, so there will not 
have any data loses unless all DNs disk fail...
RS tells "acked length" to ReplicationSource which is the data we hsynced, not 
hflushed. 
ReplicationSource only transfer data which is not larger than acked length. So 
the slave cluster will never have inconsistency.
WAL reading can handle  duplicate entries.
On WAL logging, if we get error on hflush, we open a new file and rewrite this 
entry, and recover/hsync/close old file asynchronously.

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Heng Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15040781#comment-15040781
 ] 

Heng Chen edited comment on HBASE-14790 at 12/4/15 6:20 AM:


{quote}
And for keeping an "acked length", I think we could make use of the fsync 
method in HDFS. We could call fsync asynchronously to update length on 
namenode. The replication source should not read beyond the length gotten from 
namenode(do not trust the visible length read from datanode). 
{quote}

So if we can not avoid fsync every time, maybe this way [~Apache9] mentioned is 
the best solution?   Of course, we should keep 'acked length' in RS.  Let's 
begin?


was (Author: chenheng):
{quote}
And for keeping an "acked length", I think we could make use of the fsync 
method in HDFS. We could call fsync asynchronously to update length on 
namenode. The replication source should not read beyond the length gotten from 
namenode(do not trust the visible length read from datanode). 
{quote}

So if we can not avoid fsync every time, maybe this way [~Apache9] mentioned is 
the best solution?   Shall we begin?

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14869) Better request latency histograms

2015-12-03 Thread Vikas Vishwakarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039678#comment-15039678
 ] 

Vikas Vishwakarma commented on HBASE-14869:
---

[~apurtell] thanks for the review. We do not have splunk forwarders for the 
test env but we already have daily automation scripts running on production 
logs extracting operation latencies from periodic hbase metrics dump like 
Mutate_mean, Mutate_95th_percentile. Since this is just addition to the above 
metric list, we can easily get these metrics also using the same script. 
However I have tested this only locally on dev setup but will set this up on a 
full cluster and run some long running and high load tests to check for perf 
impact, cpu usage etc and update the test results. Sounds ok? 
If the naming convention, range values used for these metrics needs to be 
changed, I can do the same based on suggestion and update the patch.

> Better request latency histograms
> -
>
> Key: HBASE-14869
> URL: https://issues.apache.org/jira/browse/HBASE-14869
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Vikas Vishwakarma
> Fix For: 2.0.0, 1.3.0, 0.98.17
>
> Attachments: 14869-test-0.98.txt, 14869-v1-0.98.txt, 
> 14869-v1-2.0.txt, 14869-v2-0.98.txt, 14869-v2-2.0.txt, 14869-v3-0.98.txt, 
> 14869-v4-0.98.txt, 14869-v5-0.98.txt, AppendSizeTime.png, Get.png
>
>
> I just discussed this with a colleague.
> The get, put, etc, histograms that each region server keeps are somewhat 
> useless (depending on what you want to achieve of course), as they are 
> aggregated and calculated by each region server.
> It would be better to record the number of requests in certainly latency 
> bands in addition to what we do now.
> For example the number of gets that took 0-5ms, 6-10ms, 10-20ms, 20-50ms, 
> 50-100ms, 100-1000ms, > 1000ms, etc. (just as an example, should be 
> configurable).
> That way we can do further calculations after the fact, and answer questions 
> like: How often did we miss our SLA? Percentage of requests that missed an 
> SLA, etc.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14772) Improve zombie detector; be more discerning

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039695#comment-15039695
 ] 

Hudson commented on HBASE-14772:


FAILURE: Integrated in HBase-Trunk_matrix #529 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/529/])
 HBASE-14772 Improve zombie detector; be more discerning; part2; (stack: rev 
5e430837d3e4a7d159e84964357297c8ab42430d)
* dev-support/test-patch.sh
* dev-support/zombie-detector.sh
 HBASE-14772 Improve zombie detector; be more discerning; part2; (stack: rev 
7117a2e35d42ef4e3f17b0a8f891fc5200cd0890)
* dev-support/zombie-detector.sh


> Improve zombie detector; be more discerning
> ---
>
> Key: HBASE-14772
> URL: https://issues.apache.org/jira/browse/HBASE-14772
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: stack
> Fix For: 2.0.0
>
> Attachments: 14772v3.patch, zombie.patch, zombiev2.patch
>
>
> Currently, any surefire process with the hbase flag is a potential zombie. 
> Our zombie check currently takes a reading and if it finds candidate zombies, 
> it waits 30 seconds and then does another reading. If a concurrent build 
> going on, in both cases the zombie detector will come up positive though the 
> adjacent test run may be making progress; i.e. the cast of surefire processes 
> may have changed between readings but our detector just sees presence of  
> hbase surefire processes.
> Here is example:
> {code}
> Suspicious java process found - waiting 30s to see if there are just slow to 
> stop
> There appear to be 5 zombie tests, they should have been killed by surefire 
> but survived
> 12823 surefirebooter852180186418035480.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 7653 surefirebooter8579074445899448699.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 12614 surefirebooter136529596936417090.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 7836 surefirebooter3217047564606450448.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 13566 surefirebooter2084039411151963494.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
>  BEGIN zombies jstack extract
>  END  zombies jstack extract
> {code}
> 5 is the number of forked processes we allow when doing medium and large 
> tests so an adjacent build will always show as '5 zombies'.
> Need to add discerning if list of processes changes between readings.
> Can I also add a tag per build run that all forked processes pick up so I can 
> look at the current builds progeny only?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14866) VerifyReplication should use peer configuration in peer connection

2015-12-03 Thread Heng Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039694#comment-15039694
 ] 

Heng Chen commented on HBASE-14866:
---

{quote}
This could even be ZKConfig, moved from hbase-client, since it's private. It 
would be an expansion of it's current reponsibilities, but doesn't seem too bad.
{quote}

I like this idea.  move ZKConfig into hbase-common sounds more reasonable.

> VerifyReplication should use peer configuration in peer connection
> --
>
> Key: HBASE-14866
> URL: https://issues.apache.org/jira/browse/HBASE-14866
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-14866.patch, HBASE-14866_v1.patch, 
> hbase-14866-v4.patch, hbase-14866_v2.patch, hbase-14866_v3.patch
>
>
> VerifyReplication uses the replication peer's configuration to construct the 
> ZooKeeper quorum address for the peer connection.  However, other 
> configuration properties in the peer's configuration are dropped.  It should 
> merge all configuration properties from the {{ReplicationPeerConfig}} when 
> creating the peer connection and obtaining a credentials for the peer cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14223) Meta WALs are not cleared if meta region was closed and RS aborts

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039696#comment-15039696
 ] 

Hudson commented on HBASE-14223:


FAILURE: Integrated in HBase-Trunk_matrix #529 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/529/])
Revert "HBASE-14223 Meta WALs are not cleared if meta region was closed (enis: 
rev bbd53b846ef6d78740f54f5cea3c73bd992dde09)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerServices.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/LogRoller.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseMetaHandler.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALFactory.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/MockRegionServerServices.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/MockRegionServer.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java


> Meta WALs are not cleared if meta region was closed and RS aborts
> -
>
> Key: HBASE-14223
> URL: https://issues.apache.org/jira/browse/HBASE-14223
> Project: HBase
>  Issue Type: Bug
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 1.0.4
>
> Attachments: HBASE-14223logs, hbase-14223_v0.patch, 
> hbase-14223_v1-branch-1.patch, hbase-14223_v2-branch-1.patch, 
> hbase-14223_v3-branch-1.patch, hbase-14223_v3-branch-1.patch, 
> hbase-14223_v3-master.patch
>
>
> When an RS opens meta, and later closes it, the WAL(FSHlog) is not closed. 
> The last WAL file just sits there in the RS WAL directory. If RS stops 
> gracefully, the WAL file for meta is deleted. Otherwise if RS aborts, WAL for 
> meta is not cleaned. It is also not split (which is correct) since master 
> determines that the RS no longer hosts meta at the time of RS abort. 
> From a cluster after running ITBLL with CM, I see a lot of {{-splitting}} 
> directories left uncleaned: 
> {code}
> [root@os-enis-dal-test-jun-4-7 cluster-os]# sudo -u hdfs hadoop fs -ls 
> /apps/hbase/data/WALs
> Found 31 items
> drwxr-xr-x   - hbase hadoop  0 2015-06-05 01:14 
> /apps/hbase/data/WALs/hregion-58203265
> drwxr-xr-x   - hbase hadoop  0 2015-06-05 07:54 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433489308745-splitting
> drwxr-xr-x   - hbase hadoop  0 2015-06-05 09:28 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433494382959-splitting
> drwxr-xr-x   - hbase hadoop  0 2015-06-05 10:01 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433498252205-splitting
> ...
> {code}
> The directories contain WALs from meta: 
> {code}
> [root@os-enis-dal-test-jun-4-7 cluster-os]# sudo -u hdfs hadoop fs -ls 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting
> Found 2 items
> -rw-r--r--   3 hbase hadoop 201608 2015-06-05 03:15 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433470511501.meta
> -rw-r--r--   3 hbase hadoop  44420 2015-06-05 04:36 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433474111645.meta
> {code}
> The RS hosted the meta region for some time: 
> {code}
> 2015-06-05 03:14:28,692 INFO  [PostOpenDeployTasks:1588230740] 
> zookeeper.MetaTableLocator: Setting hbase:meta region location in ZooKeeper 
> as os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285
> ...
> 2015-06-05 03:15:17,302 INFO  
> [RS_CLOSE_META-os-enis-dal-test-jun-4-5:16020-0] regionserver.HRegion: Closed 
> hbase:meta,,1.1588230740
> {code}
> In between, a WAL is created: 
> {code}
> 2015-06-05 03:15:11,707 INFO  
> [RS_OPEN_META-os-enis-dal-test-jun-4-5:16020-0-MetaLogRoller] wal.FSHLog: 
> Rolled WAL 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433470511501.meta
>  with entries=385, filesize=196.88 KB; new WAL 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433474111645.meta
> {code}
> When CM killed the region server later master did not see these WAL files: 
> {code}
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:46,075 
> INFO  

[jira] [Commented] (HBASE-14905) VerifyReplication does not honour versions option

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039697#comment-15039697
 ] 

Hudson commented on HBASE-14905:


FAILURE: Integrated in HBase-Trunk_matrix #529 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/529/])
HBASE-14905 VerifyReplication does not honour versions option (Vishal (tedyu: 
rev 67ba6598b1be167409a31c4e210b7218823b7beb)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java


> VerifyReplication does not honour versions option
> -
>
> Key: HBASE-14905
> URL: https://issues.apache.org/jira/browse/HBASE-14905
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Affects Versions: 2.0.0, 0.98.16
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: 14905-v2.txt, HBASE-14905.patch, HBASE-14905_v3.patch, 
> HBASE-14905_v4.patch, test.patch
>
>
> source:
> hbase(main):001:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
>
> target:
> hbase(main):023:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449030090758, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029984282, 
> value=value   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
> /bin/hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication 
> --versions=100 1 t1
> org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier$Counters
>   GOODROWS=1
> Does not show any mismatch. Ideally it should show. This is because in 
> VerifyReplication Class maxVersion is not correctly set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14795) Enhance the spark-hbase scan operations

2015-12-03 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039729#comment-15039729
 ] 

Zhan Zhang commented on HBASE-14795:


Sure. I cannot submit review in review board, and will consult other people how 
to do this.

> Enhance the spark-hbase scan operations
> ---
>
> Key: HBASE-14795
> URL: https://issues.apache.org/jira/browse/HBASE-14795
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Malaska
>Assignee: Zhan Zhang
>Priority: Minor
> Attachments: 
> 0001-HBASE-14795-Enhance-the-spark-hbase-scan-operations.patch
>
>
> This is a sub-jira of HBASE-14789.  This jira is to focus on the replacement 
> of TableInputFormat for a more custom scan implementation that will make the 
> following use case more effective.
> Use case:
> In the case you have multiple scan ranges on a single table with in a single 
> query.  TableInputFormat will scan the the outer range of the scan start and 
> end range where this implementation can be more pointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14869) Better request latency histograms

2015-12-03 Thread Vikas Vishwakarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15040770#comment-15040770
 ] 

Vikas Vishwakarma commented on HBASE-14869:
---

[~lhofhansl] I looked at splunk where we have GC logs indexed with statements 
like below that also include the "greater than" symbol for before GC after GC 

ParNew: 218868K->9270K(235968K), 0.0077550 secs] 255143K->45545K(1520064K)

Ran rex queries to parse it and verified it works fine, it was able to extract 
the proper field so that should be ok

splunk query:
"logline" |  rex "->(?[^(]+)" | table _time to_gc

> Better request latency histograms
> -
>
> Key: HBASE-14869
> URL: https://issues.apache.org/jira/browse/HBASE-14869
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Vikas Vishwakarma
> Fix For: 2.0.0, 1.3.0, 0.98.17
>
> Attachments: 14869-test-0.98.txt, 14869-v1-0.98.txt, 
> 14869-v1-2.0.txt, 14869-v2-0.98.txt, 14869-v2-2.0.txt, 14869-v3-0.98.txt, 
> 14869-v4-0.98.txt, 14869-v5-0.98.txt, AppendSizeTime.png, Get.png
>
>
> I just discussed this with a colleague.
> The get, put, etc, histograms that each region server keeps are somewhat 
> useless (depending on what you want to achieve of course), as they are 
> aggregated and calculated by each region server.
> It would be better to record the number of requests in certainly latency 
> bands in addition to what we do now.
> For example the number of gets that took 0-5ms, 6-10ms, 10-20ms, 20-50ms, 
> 50-100ms, 100-1000ms, > 1000ms, etc. (just as an example, should be 
> configurable).
> That way we can do further calculations after the fact, and answer questions 
> like: How often did we miss our SLA? Percentage of requests that missed an 
> SLA, etc.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15040809#comment-15040809
 ] 

Duo Zhang commented on HBASE-14790:
---

[~chenheng] We should make a trade off here. I do not think calling fsync every 
time is acceptable since it means namenode will have the same write pressure 
with the whole HBase cluster...

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14904) Mark Base[En|De]coder LimitedPrivate and fix binary compat issue

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041229#comment-15041229
 ] 

Hudson commented on HBASE-14904:


FAILURE: Integrated in HBase-1.1-JDK7 #1612 (See 
[https://builds.apache.org/job/HBase-1.1-JDK7/1612/])
HBASE-14904 Mark Base[En|De]coder LimitedPrivate and fix binary compat (enis: 
rev f3d3bd9d3b8eca176166391ef078391816b34bed)
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/codec/BaseEncoder.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressionContext.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/codec/BaseDecoder.java


> Mark Base[En|De]coder LimitedPrivate and fix binary compat issue
> 
>
> Key: HBASE-14904
> URL: https://issues.apache.org/jira/browse/HBASE-14904
> Project: HBase
>  Issue Type: Bug
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.3, 0.98.17, 1.0.4
>
> Attachments: hbase-14904_v1.patch, hbase-14904_v2.patch
>
>
> PHOENIX-2477 revealed that the changes from HBASE-14501 breaks binary 
> compatibility in Phoenix compiled with earlier versions of HBase and run 
> agains later versions. 
> This is one of the areas that the boundary is not clear, but it won't hurt us 
> to fix it. 
> The exception trace is: 
> {code}
> Exception in thread "main" java.lang.NoSuchFieldError: in
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$PhoenixBaseDecoder.(IndexedWALEditCodec.java:106)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueDecoder.(IndexedWALEditCodec.java:121)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec.getDecoder(IndexedWALEditCodec.java:63)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:292)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:82)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:148)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:316)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:281)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:269)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:418)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.processFile(WALPrettyPrinter.java:247)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.run(WALPrettyPrinter.java:422)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.main(WALPrettyPrinter.java:357)
> {code}
> Although {{BaseDecoder.in}} is still there, it got changed to be a class 
> rather than an interface. BaseDecoder is marked Private, thus the binary 
> compat check is not run at all. Not sure whether it would have caught this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14905) VerifyReplication does not honour versions option

2015-12-03 Thread Vishal Khandelwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal Khandelwal updated HBASE-14905:
--
Attachment: HBASE-14905_v4.patch

> VerifyReplication does not honour versions option
> -
>
> Key: HBASE-14905
> URL: https://issues.apache.org/jira/browse/HBASE-14905
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Affects Versions: 2.0.0, 0.98.16
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
> Fix For: 2.0.0
>
> Attachments: 14905-v2.txt, HBASE-14905.patch, HBASE-14905_v3.patch, 
> HBASE-14905_v4.patch, test.patch
>
>
> source:
> hbase(main):001:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
>
> target:
> hbase(main):023:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449030090758, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029984282, 
> value=value   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
> /bin/hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication 
> --versions=100 1 t1
> org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier$Counters
>   GOODROWS=1
> Does not show any mismatch. Ideally it should show. This is because in 
> VerifyReplication Class maxVersion is not correctly set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14923) VerifyReplication should not mask the exception during result comaprision

2015-12-03 Thread Vishal Khandelwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal Khandelwal updated HBASE-14923:
--
Status: Patch Available  (was: Open)

> VerifyReplication should not mask the exception during result comaprision 
> --
>
> Key: HBASE-14923
> URL: https://issues.apache.org/jira/browse/HBASE-14923
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Affects Versions: 0.98.16, 2.0.0
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
>Priority: Minor
> Fix For: 2.0.0, 0.98.16
>
>
> hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java
> Line:154
>  } catch (Exception e) {
> logFailRowAndIncreaseCounter(context, 
> Counters.CONTENT_DIFFERENT_ROWS, value);
>   }
> Just LOG.error needs to be added for more information for the failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14923) VerifyReplication should not mask the exception during result comaprision

2015-12-03 Thread Vishal Khandelwal (JIRA)
Vishal Khandelwal created HBASE-14923:
-

 Summary: VerifyReplication should not mask the exception during 
result comaprision 
 Key: HBASE-14923
 URL: https://issues.apache.org/jira/browse/HBASE-14923
 Project: HBase
  Issue Type: Bug
  Components: tooling
Affects Versions: 0.98.16, 2.0.0
Reporter: Vishal Khandelwal
Assignee: Vishal Khandelwal
Priority: Minor
 Fix For: 2.0.0, 0.98.16


hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java

Line:154
 } catch (Exception e) {
logFailRowAndIncreaseCounter(context, 
Counters.CONTENT_DIFFERENT_ROWS, value);
  }

Just LOG.error needs to be added for more information for the failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14906) Improvements on FlushLargeStoresPolicy

2015-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037731#comment-15037731
 ] 

Hadoop QA commented on HBASE-14906:
---

{color:green}+1 overall{color}.  
{color:green}+1 core zombie tests -- no zombies!{color}.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16750//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16750//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16750//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16750//console

This message is automatically generated.

> Improvements on FlushLargeStoresPolicy
> --
>
> Key: HBASE-14906
> URL: https://issues.apache.org/jira/browse/HBASE-14906
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0
>
> Attachments: HBASE-14906.patch, HBASE-14906.v2.patch, 
> HBASE-14906.v3.patch
>
>
> When checking FlushLargeStoragePolicy, found below possible improving points:
> 1. Currently in selectStoresToFlush, we will do the selection no matter how 
> many actual families, which is not necessary for one single family
> 2. Default value for hbase.hregion.percolumnfamilyflush.size.lower.bound 
> could not fit in all cases, and requires user to know details of the 
> implementation to properly set it. We propose to use 
> "hbase.hregion.memstore.flush.size/column_family_number" instead:
> {noformat}
>   
> hbase.hregion.percolumnfamilyflush.size.lower.bound
> 16777216
> 
> If FlushLargeStoresPolicy is used and there are multiple column families,
> then every time that we hit the total memstore limit, we find out all the
> column families whose memstores exceed a "lower bound" and only flush them
> while retaining the others in memory. The "lower bound" will be
> "hbase.hregion.memstore.flush.size / column_family_number" by default
> unless value of this property is larger than that. If none of the families
> have their memstore size more than lower bound, all the memstores will be
> flushed (just as usual).
> 
>   
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14905) VerifyReplication does not honour versions option

2015-12-03 Thread Vishal Khandelwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037692#comment-15037692
 ] 

Vishal Khandelwal commented on HBASE-14905:
---

[~ted_yu] and [~chenheng] : added patch with another test alongwith test 
provided by [~chenheng]. Please review the changes.

Thanks [~chenheng] for incorporating the log comment.

> VerifyReplication does not honour versions option
> -
>
> Key: HBASE-14905
> URL: https://issues.apache.org/jira/browse/HBASE-14905
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Affects Versions: 2.0.0, 0.98.16
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
> Fix For: 2.0.0
>
> Attachments: 14905-v2.txt, HBASE-14905.patch, HBASE-14905_v3.patch, 
> HBASE-14905_v4.patch, test.patch
>
>
> source:
> hbase(main):001:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
>
> target:
> hbase(main):023:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449030090758, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029984282, 
> value=value   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
> /bin/hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication 
> --versions=100 1 t1
> org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier$Counters
>   GOODROWS=1
> Does not show any mismatch. Ideally it should show. This is because in 
> VerifyReplication Class maxVersion is not correctly set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14923) VerifyReplication should not mask the exception during result comaprision

2015-12-03 Thread Vishal Khandelwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal Khandelwal updated HBASE-14923:
--
Attachment: HBASE-14923_v1.patch

> VerifyReplication should not mask the exception during result comaprision 
> --
>
> Key: HBASE-14923
> URL: https://issues.apache.org/jira/browse/HBASE-14923
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Affects Versions: 2.0.0, 0.98.16
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
>Priority: Minor
> Fix For: 2.0.0, 0.98.16
>
> Attachments: HBASE-14923_v1.patch
>
>
> hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java
> Line:154
>  } catch (Exception e) {
> logFailRowAndIncreaseCounter(context, 
> Counters.CONTENT_DIFFERENT_ROWS, value);
>   }
> Just LOG.error needs to be added for more information for the failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14923) VerifyReplication should not mask the exception during result comparison

2015-12-03 Thread Vishal Khandelwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vishal Khandelwal updated HBASE-14923:
--
Summary: VerifyReplication should not mask the exception during result 
comparison   (was: VerifyReplication should not mask the exception during 
result comaprision )

> VerifyReplication should not mask the exception during result comparison 
> -
>
> Key: HBASE-14923
> URL: https://issues.apache.org/jira/browse/HBASE-14923
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Affects Versions: 2.0.0, 0.98.16
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
>Priority: Minor
> Fix For: 2.0.0, 0.98.16
>
> Attachments: HBASE-14923_v1.patch
>
>
> hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java
> Line:154
>  } catch (Exception e) {
> logFailRowAndIncreaseCounter(context, 
> Counters.CONTENT_DIFFERENT_ROWS, value);
>   }
> Just LOG.error needs to be added for more information for the failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14869) Better request latency histograms

2015-12-03 Thread Vikas Vishwakarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikas Vishwakarma updated HBASE-14869:
--
Attachment: 14869-v5-0.98.txt

> Better request latency histograms
> -
>
> Key: HBASE-14869
> URL: https://issues.apache.org/jira/browse/HBASE-14869
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Vikas Vishwakarma
> Attachments: 14869-test-0.98.txt, 14869-v1-0.98.txt, 
> 14869-v1-2.0.txt, 14869-v2-0.98.txt, 14869-v3-0.98.txt, 14869-v4-0.98.txt, 
> 14869-v5-0.98.txt, AppendSizeTime.png, Get.png
>
>
> I just discussed this with a colleague.
> The get, put, etc, histograms that each region server keeps are somewhat 
> useless (depending on what you want to achieve of course), as they are 
> aggregated and calculated by each region server.
> It would be better to record the number of requests in certainly latency 
> bands in addition to what we do now.
> For example the number of gets that took 0-5ms, 6-10ms, 10-20ms, 20-50ms, 
> 50-100ms, 100-1000ms, > 1000ms, etc. (just as an example, should be 
> configurable).
> That way we can do further calculations after the fact, and answer questions 
> like: How often did we miss our SLA? Percentage of requests that missed an 
> SLA, etc.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14895) Seek only to the newly flushed file on scanner reset on flush

2015-12-03 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037735#comment-15037735
 ] 

ramkrishna.s.vasudevan commented on HBASE-14895:


I have patch ready for this. Once HBASE-13082 is checked in will rebase the 
patch on top of that. Found some interesting things while doing this wrt to the 
new shipped() call that we make.

> Seek only to the newly flushed file on scanner reset on flush
> -
>
> Key: HBASE-14895
> URL: https://issues.apache.org/jira/browse/HBASE-14895
> Project: HBase
>  Issue Type: Sub-task
>Reporter: ramkrishna.s.vasudevan
>Assignee: ramkrishna.s.vasudevan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14869) Better request latency histograms

2015-12-03 Thread Vikas Vishwakarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikas Vishwakarma updated HBASE-14869:
--
Attachment: 14869-v2-2.0.txt

> Better request latency histograms
> -
>
> Key: HBASE-14869
> URL: https://issues.apache.org/jira/browse/HBASE-14869
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Vikas Vishwakarma
> Attachments: 14869-test-0.98.txt, 14869-v1-0.98.txt, 
> 14869-v1-2.0.txt, 14869-v2-0.98.txt, 14869-v2-2.0.txt, 14869-v3-0.98.txt, 
> 14869-v4-0.98.txt, 14869-v5-0.98.txt, AppendSizeTime.png, Get.png
>
>
> I just discussed this with a colleague.
> The get, put, etc, histograms that each region server keeps are somewhat 
> useless (depending on what you want to achieve of course), as they are 
> aggregated and calculated by each region server.
> It would be better to record the number of requests in certainly latency 
> bands in addition to what we do now.
> For example the number of gets that took 0-5ms, 6-10ms, 10-20ms, 20-50ms, 
> 50-100ms, 100-1000ms, > 1000ms, etc. (just as an example, should be 
> configurable).
> That way we can do further calculations after the fact, and answer questions 
> like: How often did we miss our SLA? Percentage of requests that missed an 
> SLA, etc.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14869) Better request latency histograms

2015-12-03 Thread Vikas Vishwakarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037746#comment-15037746
 ] 

Vikas Vishwakarma commented on HBASE-14869:
---

the core test failure does not look related it shows the following issue 
"java.net.BindException: Address already in use"
Fixed the lineLengths issue and added unit test in the attached patch

> Better request latency histograms
> -
>
> Key: HBASE-14869
> URL: https://issues.apache.org/jira/browse/HBASE-14869
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Vikas Vishwakarma
> Attachments: 14869-test-0.98.txt, 14869-v1-0.98.txt, 
> 14869-v1-2.0.txt, 14869-v2-0.98.txt, 14869-v2-2.0.txt, 14869-v3-0.98.txt, 
> 14869-v4-0.98.txt, 14869-v5-0.98.txt, AppendSizeTime.png, Get.png
>
>
> I just discussed this with a colleague.
> The get, put, etc, histograms that each region server keeps are somewhat 
> useless (depending on what you want to achieve of course), as they are 
> aggregated and calculated by each region server.
> It would be better to record the number of requests in certainly latency 
> bands in addition to what we do now.
> For example the number of gets that took 0-5ms, 6-10ms, 10-20ms, 20-50ms, 
> 50-100ms, 100-1000ms, > 1000ms, etc. (just as an example, should be 
> configurable).
> That way we can do further calculations after the fact, and answer questions 
> like: How often did we miss our SLA? Percentage of requests that missed an 
> SLA, etc.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14915) Hanging test : org.apache.hadoop.hbase.mapreduce.TestImportExport

2015-12-03 Thread Heng Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heng Chen updated HBASE-14915:
--
Attachment: HBASE-14915-branch-1.2.patch

Try this patch. [~stack] :)

We should wait puts complete before we check data.

> Hanging test : org.apache.hadoop.hbase.mapreduce.TestImportExport
> -
>
> Key: HBASE-14915
> URL: https://issues.apache.org/jira/browse/HBASE-14915
> Project: HBase
>  Issue Type: Sub-task
>  Components: hangingTests
>Reporter: stack
> Attachments: HBASE-14915-branch-1.2.patch
>
>
> This test hangs a bunch:
> Here is latest:
> https://builds.apache.org/job/HBase-1.2/418/jdk=latest1.7,label=Hadoop/consoleText



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14918) In-Memory MemStore Flush and Compaction

2015-12-03 Thread Eshcar Hillel (JIRA)
Eshcar Hillel created HBASE-14918:
-

 Summary: In-Memory MemStore Flush and Compaction
 Key: HBASE-14918
 URL: https://issues.apache.org/jira/browse/HBASE-14918
 Project: HBase
  Issue Type: Umbrella
Affects Versions: 2.0.0
Reporter: Eshcar Hillel


A memstore serves as the in-memory component of a store unit, absorbing all 
updates to the store. From time to time these updates are flushed to a file on 
disk, where they are compacted (by eliminating redundancies) and compressed 
(i.e., written in a compressed format to reduce their storage size).

We aim to speed up data access, and therefore suggest to apply in-memory 
memstore flush. That is to flush the active in-memory segment into an 
intermediate buffer where it can be accessed by the application. Data in the 
buffer is subject to compaction and can be stored in any format that allows it 
to take up smaller space in RAM. The less space the buffer consumes the longer 
it can reside in memory before data is flushed to disk, resulting in better 
performance.
Specifically, the optimization is beneficial for workloads with medium-to-high 
key churn which incur many redundant cells, like persistent messaging. 

We suggest to structure the solution as 3 subtasks (respectively, patches). 
(1) Infrastructure - refactoring of the MemStore hierarchy, introducing segment 
(StoreSegment) as first-class citizen, and decoupling memstore scanner from the 
memstore implementation;
(2) Implementation of a new memstore (CompactingMemstore) with non-optimized 
immutable segment representation, and 
(3) Memory optimization including compressed format representation and offheap 
allocations.

This Jira continues the discussion in HBASE-13408.
Design documents, evaluation results and previous patches can be found in 
HBASE-13408. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14921) Memory optimizations

2015-12-03 Thread Eshcar Hillel (JIRA)
Eshcar Hillel created HBASE-14921:
-

 Summary: Memory optimizations
 Key: HBASE-14921
 URL: https://issues.apache.org/jira/browse/HBASE-14921
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.0.0
Reporter: Eshcar Hillel


Memory optimizations including compressed format representation and offheap 
allocations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14920) Compacting Memstore

2015-12-03 Thread Eshcar Hillel (JIRA)
Eshcar Hillel created HBASE-14920:
-

 Summary: Compacting Memstore
 Key: HBASE-14920
 URL: https://issues.apache.org/jira/browse/HBASE-14920
 Project: HBase
  Issue Type: Sub-task
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel


Implementation of a new compacting memstore with non-optimized immutable 
segment representation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14772) Improve zombie detector; be more discerning

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037629#comment-15037629
 ] 

Hudson commented on HBASE-14772:


FAILURE: Integrated in HBase-Trunk_matrix #527 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/527/])
 HBASE-14772 Improve zombie detector; be more discerning; part2; (stack: rev 
69658ea4a916c8ea5e6dd7d056a548e8dce4e96d)
* dev-support/test-patch.sh


> Improve zombie detector; be more discerning
> ---
>
> Key: HBASE-14772
> URL: https://issues.apache.org/jira/browse/HBASE-14772
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: stack
> Attachments: 14772v3.patch, zombie.patch, zombiev2.patch
>
>
> Currently, any surefire process with the hbase flag is a potential zombie. 
> Our zombie check currently takes a reading and if it finds candidate zombies, 
> it waits 30 seconds and then does another reading. If a concurrent build 
> going on, in both cases the zombie detector will come up positive though the 
> adjacent test run may be making progress; i.e. the cast of surefire processes 
> may have changed between readings but our detector just sees presence of  
> hbase surefire processes.
> Here is example:
> {code}
> Suspicious java process found - waiting 30s to see if there are just slow to 
> stop
> There appear to be 5 zombie tests, they should have been killed by surefire 
> but survived
> 12823 surefirebooter852180186418035480.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 7653 surefirebooter8579074445899448699.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 12614 surefirebooter136529596936417090.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 7836 surefirebooter3217047564606450448.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 13566 surefirebooter2084039411151963494.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
>  BEGIN zombies jstack extract
>  END  zombies jstack extract
> {code}
> 5 is the number of forked processes we allow when doing medium and large 
> tests so an adjacent build will always show as '5 zombies'.
> Need to add discerning if list of processes changes between readings.
> Can I also add a tag per build run that all forked processes pick up so I can 
> look at the current builds progeny only?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14749) Make changes to region_mover.rb to use RegionMover Java tool

2015-12-03 Thread Abhishek Singh Chouhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037455#comment-15037455
 ] 

Abhishek Singh Chouhan commented on HBASE-14749:


[~stack] [~apurtell] Should i backport these changes to branch-1 and 0.98?

> Make changes to region_mover.rb to use RegionMover Java tool
> 
>
> Key: HBASE-14749
> URL: https://issues.apache.org/jira/browse/HBASE-14749
> Project: HBase
>  Issue Type: Improvement
>Reporter: Abhishek Singh Chouhan
>Assignee: Abhishek Singh Chouhan
> Fix For: 2.0.0
>
> Attachments: HBASE-14749-v2.patch, HBASE-14749-v3.patch, 
> HBASE-14749-v3.patch, HBASE-14749-v4.patch, HBASE-14749-v5.patch, 
> HBASE-14749.patch, HBASE-14749.patch
>
>
> With HBASE-13014 in, we can now replace the ruby script such that it invokes 
> the Java Tool. Also expose timeout and no-ack mode which were added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14917) Log in console if individual tests in test-patch.sh fail or pass.

2015-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037507#comment-15037507
 ] 

Hadoop QA commented on HBASE-14917:
---

{color:green}+1 overall{color}.  
{color:green}+1 core zombie tests -- no zombies!{color}.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16748//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16748//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16748//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16748//console

This message is automatically generated.

> Log in console if individual tests in test-patch.sh fail or pass.
> -
>
> Key: HBASE-14917
> URL: https://issues.apache.org/jira/browse/HBASE-14917
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
>Priority: Minor
> Attachments: HBASE-14917.patch
>
>
> Got 2 runs like 
> https://issues.apache.org/jira/browse/HBASE-14865?focusedCommentId=15037056=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15037056
>  where can't figure out what went wrong in patch testing.
> Logging results from individual tests to console as discussed 
> [here|https://mail-archives.apache.org/mod_mbox/hbase-dev/201512.mbox/%3CCAAjhxrrL4-qty562%3DcMyBJ2xyhGqHi3MFAgf9ygrzQf1%2BZmHtw%40mail.gmail.com%3E]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14906) Improvements on FlushLargeStoresPolicy

2015-12-03 Thread Yu Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated HBASE-14906:
--
Attachment: HBASE-14906.v3.patch

Update patch to resolve UT failure, and retry HadoopQA

> Improvements on FlushLargeStoresPolicy
> --
>
> Key: HBASE-14906
> URL: https://issues.apache.org/jira/browse/HBASE-14906
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0
>
> Attachments: HBASE-14906.patch, HBASE-14906.v2.patch, 
> HBASE-14906.v3.patch
>
>
> When checking FlushLargeStoragePolicy, found below possible improving points:
> 1. Currently in selectStoresToFlush, we will do the selection no matter how 
> many actual families, which is not necessary for one single family
> 2. Default value for hbase.hregion.percolumnfamilyflush.size.lower.bound 
> could not fit in all cases, and requires user to know details of the 
> implementation to properly set it. We propose to use 
> "hbase.hregion.memstore.flush.size/column_family_number" instead:
> {noformat}
>   
> hbase.hregion.percolumnfamilyflush.size.lower.bound
> 16777216
> 
> If FlushLargeStoresPolicy is used and there are multiple column families,
> then every time that we hit the total memstore limit, we find out all the
> column families whose memstores exceed a "lower bound" and only flush them
> while retaining the others in memory. The "lower bound" will be
> "hbase.hregion.memstore.flush.size / column_family_number" by default
> unless value of this property is larger than that. If none of the families
> have their memstore size more than lower bound, all the memstores will be
> flushed (just as usual).
> 
>   
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14916) Add checkstyle_report.py to other branches

2015-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037457#comment-15037457
 ] 

Hadoop QA commented on HBASE-14916:
---

{color:red}-1 overall{color}.  

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16747//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16747//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16747//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16747//console

This message is automatically generated.

> Add checkstyle_report.py to other branches
> --
>
> Key: HBASE-14916
> URL: https://issues.apache.org/jira/browse/HBASE-14916
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-14916-branch-1.patch
>
>
> Given test-patch.sh is always run from master, and that it now uses 
> checkstyle_report.py, we should pull back the script to other branches too.
> Otherwise we see error like: 
> /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/jenkins.build/dev-support/test-patch.sh:
>  line 662: 
> /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase/dev-support/checkstyle_report.py:
>  No such file or directory
> [reference|https://builds.apache.org/job/PreCommit-HBASE-Build/16734//consoleFull]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14919) Infrastructure refactoring

2015-12-03 Thread Eshcar Hillel (JIRA)
Eshcar Hillel created HBASE-14919:
-

 Summary: Infrastructure refactoring
 Key: HBASE-14919
 URL: https://issues.apache.org/jira/browse/HBASE-14919
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.0.0
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel


Refactoring the MemStore hierarchy, introducing segment (StoreSegment) as 
first-class citizen and decoupling memstore scanner from the memstore 
implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037472#comment-15037472
 ] 

Duo Zhang commented on HBASE-14790:
---

Oh, I think we could not fix HBASE-14004 without changing the replication 
module of HBase. No matter how we implement DFSOutputStream, think of this 
scenario:

1. rs flush an WAL entry to dn1, dn2 and dn3.
2. dn1 received the WAL entry, and it is read by ReplicationSource and 
replicated to slave cluster.
3. dn1 and rs both crash, dn2 and dn3 has not received this WAL entry yet, and 
rs has not bumped the GS of this block yet.
4. NameNode complete the file with a length that does not contains this WAL 
entry since the GS of blocks on dn2 and dn3 is correct and NameNode does not 
know there used to be a block with longer length.
5. whoops...

So I think every rs should keep an "acked length" of the current writing WAL 
file, an when doing replication, ReplicationSource should ask this length first 
before reading and do not read beyond it. If we have this logic, then the 
implementation of the new "DFSOutputStream" is much simpler. We could just 
truncate the file if writing WAL failed on some datanode with our "acked 
length" and fail all the entries after the "acked length". This can keep all 
things consistency.

Thanks.

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14749) Make changes to region_mover.rb to use RegionMover Java tool

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037491#comment-15037491
 ] 

Hudson commented on HBASE-14749:


FAILURE: Integrated in HBase-Trunk_matrix #526 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/526/])
HBASE-14749 Make changes to region_mover.rb to use RegionMover Java tool 
(stack: rev 91945d7f490cbb855a1d737d1979bf9931b0f2bd)
* bin/rolling-restart.sh
* bin/thread-pool.rb
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/RegionMover.java
* bin/graceful_stop.sh
* bin/region_mover.rb


> Make changes to region_mover.rb to use RegionMover Java tool
> 
>
> Key: HBASE-14749
> URL: https://issues.apache.org/jira/browse/HBASE-14749
> Project: HBase
>  Issue Type: Improvement
>Reporter: Abhishek Singh Chouhan
>Assignee: Abhishek Singh Chouhan
> Fix For: 2.0.0
>
> Attachments: HBASE-14749-v2.patch, HBASE-14749-v3.patch, 
> HBASE-14749-v3.patch, HBASE-14749-v4.patch, HBASE-14749-v5.patch, 
> HBASE-14749.patch, HBASE-14749.patch
>
>
> With HBASE-13014 in, we can now replace the ruby script such that it invokes 
> the Java Tool. Also expose timeout and no-ack mode which were added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14772) Improve zombie detector; be more discerning

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037492#comment-15037492
 ] 

Hudson commented on HBASE-14772:


FAILURE: Integrated in HBase-Trunk_matrix #526 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/526/])
HBASE-14772 Improve zombie detector; be more discerning; part2 (stack: rev 
cf8d3bd641ef9f69dabecec1b9e87272493fe825)
* dev-support/zombie-detector.sh
* dev-support/test-patch.sh


> Improve zombie detector; be more discerning
> ---
>
> Key: HBASE-14772
> URL: https://issues.apache.org/jira/browse/HBASE-14772
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: stack
> Attachments: 14772v3.patch, zombie.patch, zombiev2.patch
>
>
> Currently, any surefire process with the hbase flag is a potential zombie. 
> Our zombie check currently takes a reading and if it finds candidate zombies, 
> it waits 30 seconds and then does another reading. If a concurrent build 
> going on, in both cases the zombie detector will come up positive though the 
> adjacent test run may be making progress; i.e. the cast of surefire processes 
> may have changed between readings but our detector just sees presence of  
> hbase surefire processes.
> Here is example:
> {code}
> Suspicious java process found - waiting 30s to see if there are just slow to 
> stop
> There appear to be 5 zombie tests, they should have been killed by surefire 
> but survived
> 12823 surefirebooter852180186418035480.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 7653 surefirebooter8579074445899448699.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 12614 surefirebooter136529596936417090.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 7836 surefirebooter3217047564606450448.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
> 13566 surefirebooter2084039411151963494.jar -enableassertions -Dhbase.test 
> -Xmx2800m -XX:MaxPermSize=256m -Djava.security.egd=file:/dev/./urandom 
> -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true
>  BEGIN zombies jstack extract
>  END  zombies jstack extract
> {code}
> 5 is the number of forked processes we allow when doing medium and large 
> tests so an adjacent build will always show as '5 zombies'.
> Need to add discerning if list of processes changes between readings.
> Can I also add a tag per build run that all forked processes pick up so I can 
> look at the current builds progeny only?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14906) Improvements on FlushLargeStoresPolicy

2015-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037519#comment-15037519
 ] 

Hadoop QA commented on HBASE-14906:
---

{color:green}+1 overall{color}.  
{color:green}+1 core zombie tests -- no zombies!{color}.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16749//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16749//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16749//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16749//console

This message is automatically generated.

> Improvements on FlushLargeStoresPolicy
> --
>
> Key: HBASE-14906
> URL: https://issues.apache.org/jira/browse/HBASE-14906
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0
>
> Attachments: HBASE-14906.patch, HBASE-14906.v2.patch
>
>
> When checking FlushLargeStoragePolicy, found below possible improving points:
> 1. Currently in selectStoresToFlush, we will do the selection no matter how 
> many actual families, which is not necessary for one single family
> 2. Default value for hbase.hregion.percolumnfamilyflush.size.lower.bound 
> could not fit in all cases, and requires user to know details of the 
> implementation to properly set it. We propose to use 
> "hbase.hregion.memstore.flush.size/column_family_number" instead:
> {noformat}
>   
> hbase.hregion.percolumnfamilyflush.size.lower.bound
> 16777216
> 
> If FlushLargeStoresPolicy is used and there are multiple column families,
> then every time that we hit the total memstore limit, we find out all the
> column families whose memstores exceed a "lower bound" and only flush them
> while retaining the others in memory. The "lower bound" will be
> "hbase.hregion.memstore.flush.size / column_family_number" by default
> unless value of this property is larger than that. If none of the families
> have their memstore size more than lower bound, all the memstores will be
> flushed (just as usual).
> 
>   
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14906) Improvements on FlushLargeStoresPolicy

2015-12-03 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037561#comment-15037561
 ] 

Yu Li commented on HBASE-14906:
---

>From the HadoopQA report, observe below failures (errors) although it says +1:
{noformat}
Tests in error: 
org.apache.hadoop.hbase.regionserver.TestBulkLoad.bulkHLogShouldThrowErrorWhenFamilySpecifiedAndHFileExistsButNotInTableDescriptor(org.apache.hadoop.hbase.regionserver.TestBulkLoad)
  Run 1: 
TestBulkLoad.bulkHLogShouldThrowErrorWhenFamilySpecifiedAndHFileExistsButNotInTableDescriptor
 � 
  Run 2: 
TestBulkLoad.bulkHLogShouldThrowErrorWhenFamilySpecifiedAndHFileExistsButNotInTableDescriptor
 � 
  Run 3: 
TestBulkLoad.bulkHLogShouldThrowErrorWhenFamilySpecifiedAndHFileExistsButNotInTableDescriptor
 � 

org.apache.hadoop.hbase.regionserver.TestBulkLoad.shouldThrowErrorIfBadFamilySpecifiedAsFamilyPath(org.apache.hadoop.hbase.regionserver.TestBulkLoad)
  Run 1: TestBulkLoad.shouldThrowErrorIfBadFamilySpecifiedAsFamilyPath �  
Unexpected ex...
  Run 2: TestBulkLoad.shouldThrowErrorIfBadFamilySpecifiedAsFamilyPath �  
Unexpected ex...
  Run 3: TestBulkLoad.shouldThrowErrorIfBadFamilySpecifiedAsFamilyPath �  
Unexpected ex...
{noformat}

And detailed exception:
{noformat}
shouldThrowErrorIfBadFamilySpecifiedAsFamilyPath(org.apache.hadoop.hbase.regionserver.TestBulkLoad)
  Time elapsed: 0.043 sec  <<< ERROR!
java.lang.Exception: Unexpected exception, 
expected but 
was
at 
org.apache.hadoop.hbase.regionserver.FlushLargeStoresPolicy.configureForRegion(FlushLargeStoresPolicy.java:59)
at 
org.apache.hadoop.hbase.regionserver.FlushPolicyFactory.create(FlushPolicyFactory.java:52)
at 
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:845)
at 
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:786)
at 
org.apache.hadoop.hbase.regionserver.HRegion.createHRegion(HRegion.java:6195)
at 
org.apache.hadoop.hbase.regionserver.HRegion.createHRegion(HRegion.java:6204)
at 
org.apache.hadoop.hbase.regionserver.TestBulkLoad.testRegionWithFamiliesAndSpecifiedTableName(TestBulkLoad.java:239)
at 
org.apache.hadoop.hbase.regionserver.TestBulkLoad.testRegionWithFamilies(TestBulkLoad.java:249)
at 
org.apache.hadoop.hbase.regionserver.TestBulkLoad.shouldThrowErrorIfBadFamilySpecifiedAsFamilyPath(TestBulkLoad.java:207)
{noformat}

Since these are Errors not Failures, the test stop at the middle phase.

The issue is caused by the patch here since it doesn't handle the case that 
column family number is zero, although this won't happen in real world, it's 
possible in our unit test case like TestBulkload.

> Improvements on FlushLargeStoresPolicy
> --
>
> Key: HBASE-14906
> URL: https://issues.apache.org/jira/browse/HBASE-14906
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0
>
> Attachments: HBASE-14906.patch, HBASE-14906.v2.patch
>
>
> When checking FlushLargeStoragePolicy, found below possible improving points:
> 1. Currently in selectStoresToFlush, we will do the selection no matter how 
> many actual families, which is not necessary for one single family
> 2. Default value for hbase.hregion.percolumnfamilyflush.size.lower.bound 
> could not fit in all cases, and requires user to know details of the 
> implementation to properly set it. We propose to use 
> "hbase.hregion.memstore.flush.size/column_family_number" instead:
> {noformat}
>   
> hbase.hregion.percolumnfamilyflush.size.lower.bound
> 16777216
> 
> If FlushLargeStoresPolicy is used and there are multiple column families,
> then every time that we hit the total memstore limit, we find out all the
> column families whose memstores exceed a "lower bound" and only flush them
> while retaining the others in memory. The "lower bound" will be
> "hbase.hregion.memstore.flush.size / column_family_number" by default
> unless value of this property is larger than that. If none of the families
> have their memstore size more than lower bound, all the memstores will be
> flushed (just as usual).
> 
>   
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14701) Fix flakey Failed tests: TestMobFlushSnapshotFromClient>TestFlushSnapshotFromClient.testSkipFlushTableSnapshot:199 null

2015-12-03 Thread Jingcheng Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037570#comment-15037570
 ] 

Jingcheng Du commented on HBASE-14701:
--

The cause of this issue is found.
testSkipFlushTableSnapshot tries to check if the snapshot family (which must 
have store files) contains a provided family. It will be wrong when there are 
no store files under the snapshot family.
In this case, if the memstore flushing is slower than the start of the 
admin.snapshot, the issue comes up.
We need to make sure the flush is finished before snapshot starts. I will 
provide a patch to fix this.

> Fix flakey Failed tests:
> TestMobFlushSnapshotFromClient>TestFlushSnapshotFromClient.testSkipFlushTableSnapshot:199
>  null
> --
>
> Key: HBASE-14701
> URL: https://issues.apache.org/jira/browse/HBASE-14701
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: stack
>Assignee: Jingcheng Du
> Attachments: disable.txt
>
>
> This test has failed twice in last 24 hours. I removed it from master for now 
> over in HBASE-14678.  It fails a lot. See here: 
> https://builds.apache.org/job/HBase-TRUNK/6962/testReport/history/  It 
> recently got refactored to remove a bunch of duplicated code.  Assigning to 
> [~jingcheng...@intel.com] to take a look if you have a chance please. 
> Otherwise, unassign. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14701) Fix flakey Failed tests: TestMobFlushSnapshotFromClient>TestFlushSnapshotFromClient.testSkipFlushTableSnapshot:199 null

2015-12-03 Thread Jingcheng Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jingcheng Du updated HBASE-14701:
-
Status: Patch Available  (was: Open)

> Fix flakey Failed tests:
> TestMobFlushSnapshotFromClient>TestFlushSnapshotFromClient.testSkipFlushTableSnapshot:199
>  null
> --
>
> Key: HBASE-14701
> URL: https://issues.apache.org/jira/browse/HBASE-14701
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: stack
>Assignee: Jingcheng Du
> Attachments: HBASE-14701.patch, disable.txt
>
>
> This test has failed twice in last 24 hours. I removed it from master for now 
> over in HBASE-14678.  It fails a lot. See here: 
> https://builds.apache.org/job/HBase-TRUNK/6962/testReport/history/  It 
> recently got refactored to remove a bunch of duplicated code.  Assigning to 
> [~jingcheng...@intel.com] to take a look if you have a chance please. 
> Otherwise, unassign. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14922) Delayed flush doesn't work causing flush storms.

2015-12-03 Thread Elliott Clark (JIRA)
Elliott Clark created HBASE-14922:
-

 Summary: Delayed flush doesn't work causing flush storms.
 Key: HBASE-14922
 URL: https://issues.apache.org/jira/browse/HBASE-14922
 Project: HBase
  Issue Type: Bug
Reporter: Elliott Clark
Assignee: Elliott Clark


Starting all regionservers at the same time will mean that most 
PeriodicMemstoreFlusher's will be running at the same time. So all of these 
threads will queue flushes at about the same time.

This was supposed to be mitigated by Delayed. However that isn't used at all. 
This results in the immediate filling up and then draining of the flush queues 
every hour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037630#comment-15037630
 ] 

Yu Li commented on HBASE-14790:
---

Agree that we may not fix HBASE-14004 by simply implementing a new 
DFSOutputStream, but I think the FanoutOutputStream is still useful to reduce 
WAL sync latency. Pipeline is good for throughput but not for latency.

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14701) Fix flakey Failed tests: TestMobFlushSnapshotFromClient>TestFlushSnapshotFromClient.testSkipFlushTableSnapshot:199 null

2015-12-03 Thread Jingcheng Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jingcheng Du updated HBASE-14701:
-
Attachment: HBASE-14701.patch

Upload the patch that adds TestMobFlushSnapshotFromClient and 
TestFlushSnapshotFromClient.testSkipFlushTableSnapshot back.

> Fix flakey Failed tests:
> TestMobFlushSnapshotFromClient>TestFlushSnapshotFromClient.testSkipFlushTableSnapshot:199
>  null
> --
>
> Key: HBASE-14701
> URL: https://issues.apache.org/jira/browse/HBASE-14701
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: stack
>Assignee: Jingcheng Du
> Attachments: HBASE-14701.patch, disable.txt
>
>
> This test has failed twice in last 24 hours. I removed it from master for now 
> over in HBASE-14678.  It fails a lot. See here: 
> https://builds.apache.org/job/HBase-TRUNK/6962/testReport/history/  It 
> recently got refactored to remove a bunch of duplicated code.  Assigning to 
> [~jingcheng...@intel.com] to take a look if you have a chance please. 
> Otherwise, unassign. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14904) Mark Base[En|De]coder LimitedPrivate and fix binary compat issue

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039724#comment-15039724
 ] 

Hudson commented on HBASE-14904:


SUCCESS: Integrated in HBase-1.2-IT #325 (See 
[https://builds.apache.org/job/HBase-1.2-IT/325/])
HBASE-14904 Mark Base[En|De]coder LimitedPrivate and fix binary compat (enis: 
rev a75a93f98ca003a172cc966464308b013b1769e4)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressionContext.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/codec/BaseDecoder.java
* hbase-common/src/main/java/org/apache/hadoop/hbase/codec/BaseEncoder.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALPrettyPrinter.java


> Mark Base[En|De]coder LimitedPrivate and fix binary compat issue
> 
>
> Key: HBASE-14904
> URL: https://issues.apache.org/jira/browse/HBASE-14904
> Project: HBase
>  Issue Type: Bug
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.3, 0.98.17, 1.0.4
>
> Attachments: hbase-14904_v1.patch, hbase-14904_v2.patch
>
>
> PHOENIX-2477 revealed that the changes from HBASE-14501 breaks binary 
> compatibility in Phoenix compiled with earlier versions of HBase and run 
> agains later versions. 
> This is one of the areas that the boundary is not clear, but it won't hurt us 
> to fix it. 
> The exception trace is: 
> {code}
> Exception in thread "main" java.lang.NoSuchFieldError: in
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$PhoenixBaseDecoder.(IndexedWALEditCodec.java:106)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueDecoder.(IndexedWALEditCodec.java:121)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec.getDecoder(IndexedWALEditCodec.java:63)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:292)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:82)
>   at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:148)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:316)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:281)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:269)
>   at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:418)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.processFile(WALPrettyPrinter.java:247)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.run(WALPrettyPrinter.java:422)
>   at 
> org.apache.hadoop.hbase.wal.WALPrettyPrinter.main(WALPrettyPrinter.java:357)
> {code}
> Although {{BaseDecoder.in}} is still there, it got changed to be a class 
> rather than an interface. BaseDecoder is marked Private, thus the binary 
> compat check is not run at all. Not sure whether it would have caught this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14905) VerifyReplication does not honour versions option

2015-12-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039723#comment-15039723
 ] 

Hudson commented on HBASE-14905:


SUCCESS: Integrated in HBase-1.2-IT #325 (See 
[https://builds.apache.org/job/HBase-1.2-IT/325/])
HBASE-14905 VerifyReplication does not honour versions option (Vishal (tedyu: 
rev eb777ef289827ae385c0cae71ea64cd6618e14af)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSmallTests.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java


> VerifyReplication does not honour versions option
> -
>
> Key: HBASE-14905
> URL: https://issues.apache.org/jira/browse/HBASE-14905
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Affects Versions: 2.0.0, 0.98.16
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: 14905-v2.txt, HBASE-14905.patch, HBASE-14905_v3.patch, 
> HBASE-14905_v4.patch, test.patch
>
>
> source:
> hbase(main):001:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
>
> target:
> hbase(main):023:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449030090758, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029984282, 
> value=value   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
> /bin/hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication 
> --versions=100 1 t1
> org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier$Counters
>   GOODROWS=1
> Does not show any mismatch. Ideally it should show. This is because in 
> VerifyReplication Class maxVersion is not correctly set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039770#comment-15039770
 ] 

Duo Zhang commented on HBASE-14790:
---

https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf

The design doc of hflush and happend already said that if all datanodes 
restart, then no guarantee could be provided. I think this is reasonable. Even 
if hflush succeeded, we could kill all the datanode in pipeline and the client 
and restart, the file after recovering lease could be shorter than the acked 
length. There will always be some situation that we could not know there is 
data loss unless we call fsync every time to update length on namenode when 
writing WAL I think. :(

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041147#comment-15041147
 ] 

Duo Zhang commented on HBASE-14790:
---

And I found that, {{hsync}} and {{hflush}} have different ack flows. {{hsync}} 
only sends ack back when the data is successfully synced to local disk, so I 
think use {{hsync}} is enough to detect if there is data loss(forget the 
{{fsync}}).

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14919) Infrastructure refactoring

2015-12-03 Thread Eshcar Hillel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eshcar Hillel updated HBASE-14919:
--
Attachment: HBASE-14919-V01.patch

patch attached

> Infrastructure refactoring
> --
>
> Key: HBASE-14919
> URL: https://issues.apache.org/jira/browse/HBASE-14919
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0
>Reporter: Eshcar Hillel
>Assignee: Eshcar Hillel
> Attachments: HBASE-14919-V01.patch
>
>
> Refactoring the MemStore hierarchy, introducing segment (StoreSegment) as 
> first-class citizen and decoupling memstore scanner from the memstore 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14919) Infrastructure refactoring

2015-12-03 Thread Eshcar Hillel (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eshcar Hillel updated HBASE-14919:
--
Status: Patch Available  (was: Open)

> Infrastructure refactoring
> --
>
> Key: HBASE-14919
> URL: https://issues.apache.org/jira/browse/HBASE-14919
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0
>Reporter: Eshcar Hillel
>Assignee: Eshcar Hillel
> Attachments: HBASE-14919-V01.patch
>
>
> Refactoring the MemStore hierarchy, introducing segment (StoreSegment) as 
> first-class citizen and decoupling memstore scanner from the memstore 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14701) Fix flakey Failed tests: TestMobFlushSnapshotFromClient>TestFlushSnapshotFromClient.testSkipFlushTableSnapshot:199 null

2015-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037863#comment-15037863
 ] 

Hadoop QA commented on HBASE-14701:
---

{color:green}+1 overall{color}.  
{color:green}+1 core zombie tests -- no zombies!{color}.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16751//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16751//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16751//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16751//console

This message is automatically generated.

> Fix flakey Failed tests:
> TestMobFlushSnapshotFromClient>TestFlushSnapshotFromClient.testSkipFlushTableSnapshot:199
>  null
> --
>
> Key: HBASE-14701
> URL: https://issues.apache.org/jira/browse/HBASE-14701
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: stack
>Assignee: Jingcheng Du
> Attachments: HBASE-14701.patch, disable.txt
>
>
> This test has failed twice in last 24 hours. I removed it from master for now 
> over in HBASE-14678.  It fails a lot. See here: 
> https://builds.apache.org/job/HBase-TRUNK/6962/testReport/history/  It 
> recently got refactored to remove a bunch of duplicated code.  Assigning to 
> [~jingcheng...@intel.com] to take a look if you have a chance please. 
> Otherwise, unassign. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14905) VerifyReplication does not honour versions option

2015-12-03 Thread Heng Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037784#comment-15037784
 ] 

Heng Chen commented on HBASE-14905:
---

please fix Indent errors in patch v4.

> VerifyReplication does not honour versions option
> -
>
> Key: HBASE-14905
> URL: https://issues.apache.org/jira/browse/HBASE-14905
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Affects Versions: 2.0.0, 0.98.16
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
> Fix For: 2.0.0
>
> Attachments: 14905-v2.txt, HBASE-14905.patch, HBASE-14905_v3.patch, 
> HBASE-14905_v4.patch, test.patch
>
>
> source:
> hbase(main):001:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
>
> target:
> hbase(main):023:0> scan 't1', {RAW => true, VERSIONS => 100}
> ROW  COLUMN+CELL  
>   
>
>  r1  column=f1:, timestamp=1449030102091, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449030090758, 
> value=value1112   
>
>  r1  column=f1:, timestamp=1449029984282, 
> value=value   
>
>  r1  column=f1:, timestamp=1449029774173, 
> value=value1001   
>
>  r1  column=f1:, timestamp=1449029709974, 
> value=value1002   
> /bin/hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication 
> --versions=100 1 t1
> org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication$Verifier$Counters
>   GOODROWS=1
> Does not show any mismatch. Ideally it should show. This is because in 
> VerifyReplication Class maxVersion is not correctly set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13408) HBase In-Memory Memstore Compaction

2015-12-03 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037831#comment-15037831
 ] 

Eshcar Hillel commented on HBASE-13408:
---

This is an umbrella Jira continuing the current issue.

> HBase In-Memory Memstore Compaction
> ---
>
> Key: HBASE-13408
> URL: https://issues.apache.org/jira/browse/HBASE-13408
> Project: HBase
>  Issue Type: New Feature
>Reporter: Eshcar Hillel
>Assignee: Eshcar Hillel
> Fix For: 2.0.0
>
> Attachments: HBASE-13408-trunk-v01.patch, 
> HBASE-13408-trunk-v02.patch, HBASE-13408-trunk-v03.patch, 
> HBASE-13408-trunk-v04.patch, HBASE-13408-trunk-v05.patch, 
> HBASE-13408-trunk-v06.patch, HBASE-13408-trunk-v07.patch, 
> HBASE-13408-trunk-v08.patch, HBASE-13408-trunk-v09.patch, 
> HBASE-13408-trunk-v10.patch, 
> HBaseIn-MemoryMemstoreCompactionDesignDocument-ver02.pdf, 
> HBaseIn-MemoryMemstoreCompactionDesignDocument-ver03.pdf, 
> HBaseIn-MemoryMemstoreCompactionDesignDocument-ver04.pdf, 
> HBaseIn-MemoryMemstoreCompactionDesignDocument.pdf, 
> InMemoryMemstoreCompactionEvaluationResults.pdf, 
> InMemoryMemstoreCompactionMasterEvaluationResults.pdf, 
> InMemoryMemstoreCompactionScansEvaluationResults.pdf, 
> StoreSegmentandStoreSegmentScannerClassHierarchies.pdf
>
>
> A store unit holds a column family in a region, where the memstore is its 
> in-memory component. The memstore absorbs all updates to the store; from time 
> to time these updates are flushed to a file on disk, where they are 
> compacted. Unlike disk components, the memstore is not compacted until it is 
> written to the filesystem and optionally to block-cache. This may result in 
> underutilization of the memory due to duplicate entries per row, for example, 
> when hot data is continuously updated. 
> Generally, the faster the data is accumulated in memory, more flushes are 
> triggered, the data sinks to disk more frequently, slowing down retrieval of 
> data, even if very recent.
> In high-churn workloads, compacting the memstore can help maintain the data 
> in memory, and thereby speed up data retrieval. 
> We suggest a new compacted memstore with the following principles:
> 1.The data is kept in memory for as long as possible
> 2.Memstore data is either compacted or in process of being compacted 
> 3.Allow a panic mode, which may interrupt an in-progress compaction and 
> force a flush of part of the memstore.
> We suggest applying this optimization only to in-memory column families.
> A design document is attached.
> This feature was previously discussed in HBASE-5311.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14906) Improvements on FlushLargeStoresPolicy

2015-12-03 Thread Yu Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated HBASE-14906:
--
Attachment: HBASE-14906.v4.patch

Another patch to resolve regression UT failure caused by not updating property 
name in TestPerColumnFamilyFlush after renaming global config 
{{hbase.hregion.percolumnfamilyflush.size.lower.bound}}

> Improvements on FlushLargeStoresPolicy
> --
>
> Key: HBASE-14906
> URL: https://issues.apache.org/jira/browse/HBASE-14906
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0
>
> Attachments: HBASE-14906.patch, HBASE-14906.v2.patch, 
> HBASE-14906.v3.patch, HBASE-14906.v4.patch
>
>
> When checking FlushLargeStoragePolicy, found below possible improving points:
> 1. Currently in selectStoresToFlush, we will do the selection no matter how 
> many actual families, which is not necessary for one single family
> 2. Default value for hbase.hregion.percolumnfamilyflush.size.lower.bound 
> could not fit in all cases, and requires user to know details of the 
> implementation to properly set it. We propose to use 
> "hbase.hregion.memstore.flush.size/column_family_number" instead:
> {noformat}
>   
> hbase.hregion.percolumnfamilyflush.size.lower.bound
> 16777216
> 
> If FlushLargeStoresPolicy is used and there are multiple column families,
> then every time that we hit the total memstore limit, we find out all the
> column families whose memstores exceed a "lower bound" and only flush them
> while retaining the others in memory. The "lower bound" will be
> "hbase.hregion.memstore.flush.size / column_family_number" by default
> unless value of this property is larger than that. If none of the families
> have their memstore size more than lower bound, all the memstores will be
> flushed (just as usual).
> 
>   
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14918) In-Memory MemStore Flush and Compaction

2015-12-03 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037817#comment-15037817
 ] 

Eshcar Hillel commented on HBASE-14918:
---

Submitted patch for first sub-task 

> In-Memory MemStore Flush and Compaction
> ---
>
> Key: HBASE-14918
> URL: https://issues.apache.org/jira/browse/HBASE-14918
> Project: HBase
>  Issue Type: Umbrella
>Affects Versions: 2.0.0
>Reporter: Eshcar Hillel
>
> A memstore serves as the in-memory component of a store unit, absorbing all 
> updates to the store. From time to time these updates are flushed to a file 
> on disk, where they are compacted (by eliminating redundancies) and 
> compressed (i.e., written in a compressed format to reduce their storage 
> size).
> We aim to speed up data access, and therefore suggest to apply in-memory 
> memstore flush. That is to flush the active in-memory segment into an 
> intermediate buffer where it can be accessed by the application. Data in the 
> buffer is subject to compaction and can be stored in any format that allows 
> it to take up smaller space in RAM. The less space the buffer consumes the 
> longer it can reside in memory before data is flushed to disk, resulting in 
> better performance.
> Specifically, the optimization is beneficial for workloads with 
> medium-to-high key churn which incur many redundant cells, like persistent 
> messaging. 
> We suggest to structure the solution as 3 subtasks (respectively, patches). 
> (1) Infrastructure - refactoring of the MemStore hierarchy, introducing 
> segment (StoreSegment) as first-class citizen, and decoupling memstore 
> scanner from the memstore implementation;
> (2) Implementation of a new memstore (CompactingMemstore) with non-optimized 
> immutable segment representation, and 
> (3) Memory optimization including compressed format representation and 
> offheap allocations.
> This Jira continues the discussion in HBASE-13408.
> Design documents, evaluation results and previous patches can be found in 
> HBASE-13408. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13408) HBase In-Memory Memstore Compaction

2015-12-03 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037825#comment-15037825
 ] 

Eshcar Hillel commented on HBASE-13408:
---

Created new Jira HBASE-14918 with three sub-tasks.
Submitted patch for first refactoring task.
This Jira is EOL; if you wish to continue followING this issue please start 
watching HBASE-14918 (and/or HBASE-14919/HBASE-14920/HBASE-14921).

> HBase In-Memory Memstore Compaction
> ---
>
> Key: HBASE-13408
> URL: https://issues.apache.org/jira/browse/HBASE-13408
> Project: HBase
>  Issue Type: New Feature
>Reporter: Eshcar Hillel
>Assignee: Eshcar Hillel
> Fix For: 2.0.0
>
> Attachments: HBASE-13408-trunk-v01.patch, 
> HBASE-13408-trunk-v02.patch, HBASE-13408-trunk-v03.patch, 
> HBASE-13408-trunk-v04.patch, HBASE-13408-trunk-v05.patch, 
> HBASE-13408-trunk-v06.patch, HBASE-13408-trunk-v07.patch, 
> HBASE-13408-trunk-v08.patch, HBASE-13408-trunk-v09.patch, 
> HBASE-13408-trunk-v10.patch, 
> HBaseIn-MemoryMemstoreCompactionDesignDocument-ver02.pdf, 
> HBaseIn-MemoryMemstoreCompactionDesignDocument-ver03.pdf, 
> HBaseIn-MemoryMemstoreCompactionDesignDocument-ver04.pdf, 
> HBaseIn-MemoryMemstoreCompactionDesignDocument.pdf, 
> InMemoryMemstoreCompactionEvaluationResults.pdf, 
> InMemoryMemstoreCompactionMasterEvaluationResults.pdf, 
> InMemoryMemstoreCompactionScansEvaluationResults.pdf, 
> StoreSegmentandStoreSegmentScannerClassHierarchies.pdf
>
>
> A store unit holds a column family in a region, where the memstore is its 
> in-memory component. The memstore absorbs all updates to the store; from time 
> to time these updates are flushed to a file on disk, where they are 
> compacted. Unlike disk components, the memstore is not compacted until it is 
> written to the filesystem and optionally to block-cache. This may result in 
> underutilization of the memory due to duplicate entries per row, for example, 
> when hot data is continuously updated. 
> Generally, the faster the data is accumulated in memory, more flushes are 
> triggered, the data sinks to disk more frequently, slowing down retrieval of 
> data, even if very recent.
> In high-churn workloads, compacting the memstore can help maintain the data 
> in memory, and thereby speed up data retrieval. 
> We suggest a new compacted memstore with the following principles:
> 1.The data is kept in memory for as long as possible
> 2.Memstore data is either compacted or in process of being compacted 
> 3.Allow a panic mode, which may interrupt an in-progress compaction and 
> force a flush of part of the memstore.
> We suggest applying this optimization only to in-memory column families.
> A design document is attached.
> This feature was previously discussed in HBASE-5311.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-14924) Slow response from HBASE REStful interface

2015-12-03 Thread Moulay Amine Jaidi (JIRA)
Moulay Amine Jaidi created HBASE-14924:
--

 Summary: Slow response from HBASE REStful interface
 Key: HBASE-14924
 URL: https://issues.apache.org/jira/browse/HBASE-14924
 Project: HBase
  Issue Type: Brainstorming
  Components: REST
Affects Versions: 1.1.1
 Environment: IBM Biginsights 4.1
Reporter: Moulay Amine Jaidi
Priority: Blocker



We are currently experiencing an issue with HBase through the REST interface. 
Previously we were on version 0.96 and were ables to run the following REST 
command successfully and very quickly

http://10.92.211.22:60800/tableName/RAWKEY.*

At the moment after doing an upgrade to 1.1.1 this request takes a lot longer 
to retirive results (count is 12 items to return)

Are there any configurations or known issues that may affect this




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14906) Improvements on FlushLargeStoresPolicy

2015-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038063#comment-15038063
 ] 

Hadoop QA commented on HBASE-14906:
---

{color:green}+1 overall{color}.  
{color:green}+1 core zombie tests -- no zombies!{color}.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16755//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16755//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16755//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16755//console

This message is automatically generated.

> Improvements on FlushLargeStoresPolicy
> --
>
> Key: HBASE-14906
> URL: https://issues.apache.org/jira/browse/HBASE-14906
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0
>
> Attachments: HBASE-14906.patch, HBASE-14906.v2.patch, 
> HBASE-14906.v3.patch, HBASE-14906.v4.patch
>
>
> When checking FlushLargeStoragePolicy, found below possible improving points:
> 1. Currently in selectStoresToFlush, we will do the selection no matter how 
> many actual families, which is not necessary for one single family
> 2. Default value for hbase.hregion.percolumnfamilyflush.size.lower.bound 
> could not fit in all cases, and requires user to know details of the 
> implementation to properly set it. We propose to use 
> "hbase.hregion.memstore.flush.size/column_family_number" instead:
> {noformat}
>   
> hbase.hregion.percolumnfamilyflush.size.lower.bound
> 16777216
> 
> If FlushLargeStoresPolicy is used and there are multiple column families,
> then every time that we hit the total memstore limit, we find out all the
> column families whose memstores exceed a "lower bound" and only flush them
> while retaining the others in memory. The "lower bound" will be
> "hbase.hregion.memstore.flush.size / column_family_number" by default
> unless value of this property is larger than that. If none of the families
> have their memstore size more than lower bound, all the memstores will be
> flushed (just as usual).
> 
>   
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038167#comment-15038167
 ] 

stack commented on HBASE-14790:
---

bq. So I think every rs should keep an "acked length" of the current writing 
WAL file, an when doing replication

HBase owning this fact is way to go. There is no way to get this info from 
current dfsclient, right? It would be a new metadata that this new work would 
reveal?

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14906) Improvements on FlushLargeStoresPolicy

2015-12-03 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038139#comment-15038139
 ] 

Yu Li commented on HBASE-14906:
---

Confirmed no more UT failure from the testReport. However, the report still 
looks strange: summary about core tests, javadoc, etc. seems to disappear. 
[~stack] could you please take a look here sir?

> Improvements on FlushLargeStoresPolicy
> --
>
> Key: HBASE-14906
> URL: https://issues.apache.org/jira/browse/HBASE-14906
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0
>
> Attachments: HBASE-14906.patch, HBASE-14906.v2.patch, 
> HBASE-14906.v3.patch, HBASE-14906.v4.patch
>
>
> When checking FlushLargeStoragePolicy, found below possible improving points:
> 1. Currently in selectStoresToFlush, we will do the selection no matter how 
> many actual families, which is not necessary for one single family
> 2. Default value for hbase.hregion.percolumnfamilyflush.size.lower.bound 
> could not fit in all cases, and requires user to know details of the 
> implementation to properly set it. We propose to use 
> "hbase.hregion.memstore.flush.size/column_family_number" instead:
> {noformat}
>   
> hbase.hregion.percolumnfamilyflush.size.lower.bound
> 16777216
> 
> If FlushLargeStoresPolicy is used and there are multiple column families,
> then every time that we hit the total memstore limit, we find out all the
> column families whose memstores exceed a "lower bound" and only flush them
> while retaining the others in memory. The "lower bound" will be
> "hbase.hregion.memstore.flush.size / column_family_number" by default
> unless value of this property is larger than that. If none of the families
> have their memstore size more than lower bound, all the memstores will be
> flushed (just as usual).
> 
>   
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-12-03 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038178#comment-15038178
 ] 

Zhe Zhang commented on HBASE-14790:
---

[~stack] {{DataStreamer#block}} tracks the "number of bytes acked". It is 
returned by {{DFSOutputStream#getBlock}}

[~Apache9] I'm still reading your analysis, will get back shortly

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14919) Infrastructure refactoring

2015-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038145#comment-15038145
 ] 

Hadoop QA commented on HBASE-14919:
---

{color:red}-1 overall{color}.  
{color:green}+1 core zombie tests -- no zombies!{color}.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16756//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16756//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16756//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16756//console

This message is automatically generated.

> Infrastructure refactoring
> --
>
> Key: HBASE-14919
> URL: https://issues.apache.org/jira/browse/HBASE-14919
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0
>Reporter: Eshcar Hillel
>Assignee: Eshcar Hillel
> Attachments: HBASE-14919-V01.patch
>
>
> Refactoring the MemStore hierarchy, introducing segment (StoreSegment) as 
> first-class citizen and decoupling memstore scanner from the memstore 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14701) Fix flakey Failed tests: TestMobFlushSnapshotFromClient>TestFlushSnapshotFromClient.testSkipFlushTableSnapshot:199 null

2015-12-03 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038172#comment-15038172
 ] 

stack commented on HBASE-14701:
---

Thanks [~jingcheng...@intel.com] I've broken hadoopqa for the moment hence the 
odd like report. Will retry your patch when stuff put together again.

> Fix flakey Failed tests:
> TestMobFlushSnapshotFromClient>TestFlushSnapshotFromClient.testSkipFlushTableSnapshot:199
>  null
> --
>
> Key: HBASE-14701
> URL: https://issues.apache.org/jira/browse/HBASE-14701
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Reporter: stack
>Assignee: Jingcheng Du
> Attachments: HBASE-14701.patch, disable.txt
>
>
> This test has failed twice in last 24 hours. I removed it from master for now 
> over in HBASE-14678.  It fails a lot. See here: 
> https://builds.apache.org/job/HBase-TRUNK/6962/testReport/history/  It 
> recently got refactored to remove a bunch of duplicated code.  Assigning to 
> [~jingcheng...@intel.com] to take a look if you have a chance please. 
> Otherwise, unassign. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14923) VerifyReplication should not mask the exception during result comparison

2015-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037897#comment-15037897
 ] 

Hadoop QA commented on HBASE-14923:
---

{color:red}-1 overall{color}.  
{color:green}+1 core zombie tests -- no zombies!{color}.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16753//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16753//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16753//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/16753//console

This message is automatically generated.

> VerifyReplication should not mask the exception during result comparison 
> -
>
> Key: HBASE-14923
> URL: https://issues.apache.org/jira/browse/HBASE-14923
> Project: HBase
>  Issue Type: Bug
>  Components: tooling
>Affects Versions: 2.0.0, 0.98.16
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
>Priority: Minor
> Fix For: 2.0.0, 0.98.16
>
> Attachments: HBASE-14923_v1.patch
>
>
> hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java
> Line:154
>  } catch (Exception e) {
> logFailRowAndIncreaseCounter(context, 
> Counters.CONTENT_DIFFERENT_ROWS, value);
>   }
> Just LOG.error needs to be added for more information for the failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14869) Better request latency histograms

2015-12-03 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-14869:
---
Fix Version/s: 0.98.17
   1.3.0
   2.0.0

> Better request latency histograms
> -
>
> Key: HBASE-14869
> URL: https://issues.apache.org/jira/browse/HBASE-14869
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Vikas Vishwakarma
> Fix For: 2.0.0, 1.3.0, 0.98.17
>
> Attachments: 14869-test-0.98.txt, 14869-v1-0.98.txt, 
> 14869-v1-2.0.txt, 14869-v2-0.98.txt, 14869-v2-2.0.txt, 14869-v3-0.98.txt, 
> 14869-v4-0.98.txt, 14869-v5-0.98.txt, AppendSizeTime.png, Get.png
>
>
> I just discussed this with a colleague.
> The get, put, etc, histograms that each region server keeps are somewhat 
> useless (depending on what you want to achieve of course), as they are 
> aggregated and calculated by each region server.
> It would be better to record the number of requests in certainly latency 
> bands in addition to what we do now.
> For example the number of gets that took 0-5ms, 6-10ms, 10-20ms, 20-50ms, 
> 50-100ms, 100-1000ms, > 1000ms, etc. (just as an example, should be 
> configurable).
> That way we can do further calculations after the fact, and answer questions 
> like: How often did we miss our SLA? Percentage of requests that missed an 
> SLA, etc.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14924) Slow response from HBASE REStful interface

2015-12-03 Thread Moulay Amine Jaidi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038494#comment-15038494
 ] 

Moulay Amine Jaidi commented on HBASE-14924:


Thanks Andrew

> Slow response from HBASE REStful interface
> --
>
> Key: HBASE-14924
> URL: https://issues.apache.org/jira/browse/HBASE-14924
> Project: HBase
>  Issue Type: Brainstorming
>  Components: REST
>Affects Versions: 1.1.1
> Environment: IBM Biginsights 4.1
>Reporter: Moulay Amine Jaidi
>Priority: Blocker
>  Labels: REST, hbase-rest, slow-scan
>
> We are currently experiencing an issue with HBase through the REST interface. 
> Previously we were on version 0.96 and were ables to run the following REST 
> command successfully and very quickly
> http://10.92.211.22:60800/tableName/RAWKEY.*
> At the moment after doing an upgrade to 1.1.1 this request takes a lot longer 
> to retirive results (count is 12 items to return)
> Are there any configurations or known issues that may affect this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14907) NPE of MobUtils.hasMobColumns in Build failed in Jenkins: HBase-Trunk_matrix » latest1.8,Hadoop #513

2015-12-03 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038390#comment-15038390
 ] 

Ted Yu commented on HBASE-14907:


lgtm

> NPE of MobUtils.hasMobColumns in Build failed in Jenkins: HBase-Trunk_matrix 
> » latest1.8,Hadoop #513
> 
>
> Key: HBASE-14907
> URL: https://issues.apache.org/jira/browse/HBASE-14907
> Project: HBase
>  Issue Type: Bug
>  Components: mob
>Reporter: Jingcheng Du
>Assignee: Jingcheng Du
> Attachments: HBASE-14907-V2.patch, HBASE-14907.patch
>
>
> NPE is thrown when rollback the failures of table creation.
> 1. Table is being created, get issues when creating fs layout.
> 2. Rollback this creation and trying to delete the data from fs. It tries to 
> delete the mob dir and needs to ask HMaster about the HTableDescriptor, and 
> at that time the table dir had been deleted and no HTableDescriptor can be 
> found.
> In this patch, it directly checks if the mob directory is existing instead of 
> checking the HTableDescriptor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14866) VerifyReplication should use peer configuration in peer connection

2015-12-03 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038429#comment-15038429
 ] 

Gary Helmling commented on HBASE-14866:
---

bq. Should transformClusterKey just use standardizeZKQuorumServerString rather 
than having two different cases?

Sure, I can clean that up along the way.

bq. Seems like the entry for shouldbemissing will have two prefixes. Is that 
intended?

Thanks, nice catch! I'll fix that.

bq. why not move buildZKQuorumServerString and standardizeZKQuorumServerString 
into ZKUtil? 

ZKUtil deals mostly with ZK related operations (adding, watching znodes).  
Building a quorum string is how we handle multiple ZK configs in a 
Configuration.  So it seems more configuration related than ZK operation 
related to me.  In addition ZKUtil is in hbase-client so we can't use anything 
in it from hbase-common.  We could move everything, including 
createClusterConf(), to ZKUtil, but that seems a weird home for it to me, since 
the original problem was a failure to handle additional configuration 
properties (beyond ZK quorum config) for target/destination clusters.

bq. Most of these changes to HBaseConfiguration seem to be very replication 
specific. Should we have a different class for replication based configuration, 
so that HBaseConfiguration doesn't get too unwieldy?

bq. Agreed. Maybe we need something like ReplicationUtils ?

These changes go beyond replication usage.  This is a common problem wherever a 
program needs to handle talking to two clusters with a single Configuration.  
This applies to CopyTable, SyncTable and TableOutputFormat, none of which 
assumes any replication configuration.  By comparison, the replication code 
actually abstracts the usage of these ZK configuration utilities pretty well, 
except for the couple of problems in ReplicationAdmin and VerifyReplication.  
In the non-replication cases, we need to be able to handle: applying different 
ZK quorum configurations for the different clusters, and overriding other 
configuration properties (for example security-related config) for the other 
clusters.  {{HBaseConfiguration.createClusterConf()}} is the cleanest way I can 
see of abstracting this, especially for all of the non-replication usage.  This 
also seems like clearly a configuration problem to me, so HBaseConfiguration 
seems like the right home.  That is how we handle creating new HBase 
configurations everywhere (via {{HBaseConfiguration.create()}}), so this seems 
analogous.

If we're worried about bloating HBaseConfiguration with the additions moved 
from ZKUtil, then I could create a new util class in hbase-common to hold them, 
but I think we already have a proliferation of config related methods spread 
across multiple utility classes:
* ConfigurationUtils in hbase-server -- I would put the methods there, but we 
need access to them from hbase-client and hbase-server, so hbase-common seems 
like the right home.  ConfigurationUtils is annotated public, so we can't just 
move it without compatibility concerns.
* ZKUtil in hbase-client -- this class deals mostly with operations on 
ZooKeeper (adding, watching znodes), so I think removing all the config methods 
actually made for a cleaner separation of ZK operations vs. configuration 
related manipulations.  Since ZKUtil is in hbase-client we also can't depend on 
it from hbase-common.  We could move it to hbase-common, but that would 
introduce a new dependency on ZooKeeper in hbase-common that is not currently 
there.
* ZKConfig in hbase-client -- this currently deals with creating a ZK 
properties based configuration for HQuorumPeer.  So again moving the methods 
there would be expanding what it currently handles and has the additional 
problem of being in hbase-client, so createClusterConf() would have to move 
there as well.

It seems to me like we have two best options:
* move the ZK related config options to a new private util class in 
hbase-common.  This could even be ZKConfig, moved from hbase-client, since it's 
private.  It would be an expansion of it's current reponsibilities, but doesn't 
seem too bad.
* go back to the original targeted fixes to ReplicationAdmin and 
VerifyReplication, since those are the actual problems I'm trying to solve.

What do you guys think? I'll hold off on further changes to this until we get 
some consensus.


> VerifyReplication should use peer configuration in peer connection
> --
>
> Key: HBASE-14866
> URL: https://issues.apache.org/jira/browse/HBASE-14866
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Gary Helmling
>Assignee: Gary Helmling
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-14866.patch, HBASE-14866_v1.patch, 
> hbase-14866-v4.patch, 

[jira] [Commented] (HBASE-14900) Make global config option for ReplicationEndpoint

2015-12-03 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038343#comment-15038343
 ] 

Andrew Purtell commented on HBASE-14900:


Sounds good generically, do you have a patch we could look at?

> Make global config option for ReplicationEndpoint
> -
>
> Key: HBASE-14900
> URL: https://issues.apache.org/jira/browse/HBASE-14900
> Project: HBase
>  Issue Type: Sub-task
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Cody Marcel
>Assignee: Cody Marcel
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently ReplicationEndpoint implementations can only be configured through 
> the HBase shell. We should be able to to use a property in the hbase-site.xml 
> to globally set an alternate default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HBASE-14822) Renewing leases of scanners doesn't work

2015-12-03 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038340#comment-15038340
 ] 

Andrew Purtell edited comment on HBASE-14822 at 12/3/15 7:03 PM:
-

bq. There's something with region replicas and the specific meaning of 
requesting 0 rows.
Ah, I should have tested more than the 0.98 patch.

bq. Seems like I should regroup and just add another flag to the scan PB 
request.
Seems so


was (Author: apurtell):
bq. There's something with region replicas and the specific meaning of 
requesting 0 rows.
Should have tested more than the 0.98 patch.

bq. Seems like I should regroup and just add another flag to the scan PB 
request.
Seems so

> Renewing leases of scanners doesn't work
> 
>
> Key: HBASE-14822
> URL: https://issues.apache.org/jira/browse/HBASE-14822
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.98.14
>Reporter: Samarth Jain
>Assignee: Lars Hofhansl
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.3, 0.98.17, 1.0.4
>
> Attachments: 14822-0.98-v2.txt, 14822-0.98-v3.txt, 14822-0.98.txt, 
> 14822-v3-0.98.txt, 14822.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HBASE-14924) Slow response from HBASE REStful interface

2015-12-03 Thread Andrew Purtell (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-14924.

Resolution: Invalid

This is the project development tracker.

For user help and troubleshooting advice, please write to 
u...@hbase.apache.org. 


> Slow response from HBASE REStful interface
> --
>
> Key: HBASE-14924
> URL: https://issues.apache.org/jira/browse/HBASE-14924
> Project: HBase
>  Issue Type: Brainstorming
>  Components: REST
>Affects Versions: 1.1.1
> Environment: IBM Biginsights 4.1
>Reporter: Moulay Amine Jaidi
>Priority: Blocker
>  Labels: REST, hbase-rest, slow-scan
>
> We are currently experiencing an issue with HBase through the REST interface. 
> Previously we were on version 0.96 and were ables to run the following REST 
> command successfully and very quickly
> http://10.92.211.22:60800/tableName/RAWKEY.*
> At the moment after doing an upgrade to 1.1.1 this request takes a lot longer 
> to retirive results (count is 12 items to return)
> Are there any configurations or known issues that may affect this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14907) NPE of MobUtils.hasMobColumns in Build failed in Jenkins: HBase-Trunk_matrix » latest1.8,Hadoop #513

2015-12-03 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038407#comment-15038407
 ] 

Matteo Bertozzi commented on HBASE-14907:
-

+1 on v2

> NPE of MobUtils.hasMobColumns in Build failed in Jenkins: HBase-Trunk_matrix 
> » latest1.8,Hadoop #513
> 
>
> Key: HBASE-14907
> URL: https://issues.apache.org/jira/browse/HBASE-14907
> Project: HBase
>  Issue Type: Bug
>  Components: mob
>Reporter: Jingcheng Du
>Assignee: Jingcheng Du
> Attachments: HBASE-14907-V2.patch, HBASE-14907.patch
>
>
> NPE is thrown when rollback the failures of table creation.
> 1. Table is being created, get issues when creating fs layout.
> 2. Rollback this creation and trying to delete the data from fs. It tries to 
> delete the mob dir and needs to ask HMaster about the HTableDescriptor, and 
> at that time the table dir had been deleted and no HTableDescriptor can be 
> found.
> In this patch, it directly checks if the mob directory is existing instead of 
> checking the HTableDescriptor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14922) Delayed flush doesn't work causing flush storms.

2015-12-03 Thread Elliott Clark (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elliott Clark updated HBASE-14922:
--
Fix Version/s: 1.3.0
   1.2.0
   2.0.0
Affects Version/s: 1.2.0
   2.0.0
   1.1.2
   Status: Patch Available  (was: Open)

> Delayed flush doesn't work causing flush storms.
> 
>
> Key: HBASE-14922
> URL: https://issues.apache.org/jira/browse/HBASE-14922
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 1.1.2, 2.0.0, 1.2.0
>Reporter: Elliott Clark
>Assignee: Elliott Clark
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-14922-v1.patch, HBASE-14922.patch
>
>
> Starting all regionservers at the same time will mean that most 
> PeriodicMemstoreFlusher's will be running at the same time. So all of these 
> threads will queue flushes at about the same time.
> This was supposed to be mitigated by Delayed. However that isn't nearly 
> enough. This results in the immediate filling up and then draining of the 
> flush queues every hour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-14922) Delayed flush doesn't work causing flush storms.

2015-12-03 Thread Elliott Clark (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elliott Clark updated HBASE-14922:
--
Component/s: regionserver

> Delayed flush doesn't work causing flush storms.
> 
>
> Key: HBASE-14922
> URL: https://issues.apache.org/jira/browse/HBASE-14922
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 2.0.0, 1.2.0, 1.1.2
>Reporter: Elliott Clark
>Assignee: Elliott Clark
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-14922-v1.patch, HBASE-14922.patch
>
>
> Starting all regionservers at the same time will mean that most 
> PeriodicMemstoreFlusher's will be running at the same time. So all of these 
> threads will queue flushes at about the same time.
> This was supposed to be mitigated by Delayed. However that isn't nearly 
> enough. This results in the immediate filling up and then draining of the 
> flush queues every hour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14869) Better request latency histograms

2015-12-03 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038333#comment-15038333
 ] 

Andrew Purtell commented on HBASE-14869:


The latest patches look lgtm. 
Are we sure the new output is consumable and useful for the intended purpose 
[~lhofhansl] [~vik.karma] ? Maybe try this in a test environment (for our 
purposes, with Splunk)?

> Better request latency histograms
> -
>
> Key: HBASE-14869
> URL: https://issues.apache.org/jira/browse/HBASE-14869
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Vikas Vishwakarma
> Fix For: 2.0.0, 1.3.0, 0.98.17
>
> Attachments: 14869-test-0.98.txt, 14869-v1-0.98.txt, 
> 14869-v1-2.0.txt, 14869-v2-0.98.txt, 14869-v2-2.0.txt, 14869-v3-0.98.txt, 
> 14869-v4-0.98.txt, 14869-v5-0.98.txt, AppendSizeTime.png, Get.png
>
>
> I just discussed this with a colleague.
> The get, put, etc, histograms that each region server keeps are somewhat 
> useless (depending on what you want to achieve of course), as they are 
> aggregated and calculated by each region server.
> It would be better to record the number of requests in certainly latency 
> bands in addition to what we do now.
> For example the number of gets that took 0-5ms, 6-10ms, 10-20ms, 20-50ms, 
> 50-100ms, 100-1000ms, > 1000ms, etc. (just as an example, should be 
> configurable).
> That way we can do further calculations after the fact, and answer questions 
> like: How often did we miss our SLA? Percentage of requests that missed an 
> SLA, etc.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14822) Renewing leases of scanners doesn't work

2015-12-03 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038340#comment-15038340
 ] 

Andrew Purtell commented on HBASE-14822:


bq. There's something with region replicas and the specific meaning of 
requesting 0 rows.
Should have tested more than the 0.98 patch.

bq. Seems like I should regroup and just add another flag to the scan PB 
request.
Seems so

> Renewing leases of scanners doesn't work
> 
>
> Key: HBASE-14822
> URL: https://issues.apache.org/jira/browse/HBASE-14822
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.98.14
>Reporter: Samarth Jain
>Assignee: Lars Hofhansl
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.3, 0.98.17, 1.0.4
>
> Attachments: 14822-0.98-v2.txt, 14822-0.98-v3.txt, 14822-0.98.txt, 
> 14822-v3-0.98.txt, 14822.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-14903) Table Or Region?

2015-12-03 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038402#comment-15038402
 ] 

Andrew Purtell commented on HBASE-14903:


bq. I think this sentence "When a table is in the process of splitting," should 
be "When a Region is in the process of splitting," on chapter 【62.2. hbase:meta】

Yes


bq. By the way,is this document the 
latest?【http://hbase.apache.org/book.html#arch.overview】I will translate it!

Yes, thanks!


> Table Or Region?
> 
>
> Key: HBASE-14903
> URL: https://issues.apache.org/jira/browse/HBASE-14903
> Project: HBase
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.0.0
>Reporter: 胡托
>Priority: Blocker
>
>  I've been reading on Latest Reference Guide and try to translated into 
> Chinese!
>  I think this sentence "When a table is in the process of splitting," 
> should be "When a Region is in the process of splitting," on chapter 【62.2. 
> hbase:meta】。
>  By the way,is this document the 
> latest?【http://hbase.apache.org/book.html#arch.overview】I will translate it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >