[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553312#comment-14553312
 ] 

Hudson commented on HBASE-13071:


FAILURE: Integrated in HBase-TRUNK #6499 (See 
[https://builds.apache.org/job/HBase-TRUNK/6499/])
HBASE-13071 synchronous scanner -- cache size-in-bytes bug fix (stack: rev 
7f2b33dbbf90474a8f73e4d38ea8f6817ee3dcdb)
* 
hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientAsyncPrefetchScanner.java


 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel
 Fix For: 2.0.0

 Attachments: 99.eshcar.png, HBASE-13071-0_98.patch, 
 HBASE-13071-BRANCH-1.patch, HBASE-13071-trunk-bug-fix.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-18 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548148#comment-14548148
 ] 

stack commented on HBASE-13071:
---

[~eshcar] Thanks for finding issue. Please open new issue. This one is dense 
enough already. Thank you (FYI, you do not need to clean up old patches -- 
thanks).

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel
 Fix For: 2.0.0

 Attachments: 99.eshcar.png, HBASE-13071-0_98.patch, 
 HBASE-13071-BRANCH-1.patch, HBASE-13071-trunk-bug-fix.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-18 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547627#comment-14547627
 ] 

Eshcar Hillel commented on HBASE-13071:
---

Hi ~stack,

Attached 2 new patches for branch-1 and 0.98.
While preparing these patches I discovered that in asynchronous scanner the 
cache byte-size variable is not updated in one of the places where polling item 
from the cache. Therefore I also attach a patch to fix this bug in trunk - it 
is a small local fix in ClientAsyncPrefetchScanner.java (this is already fixed 
in the patches for branch-1 and 0.98).

Will you be able to apply the patches?

Also do we need to open a new Jira for the refugee patch or is it ok to post it 
here?

Thanks,
Eshcar

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel
 Fix For: 2.0.0

 Attachments: 99.eshcar.png, HBASE-13071-0_98.patch, 
 HBASE-13071-BRANCH-1.patch, HBASE-13071-trunk-bug-fix.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-14 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543715#comment-14543715
 ] 

Edward Bortnikov commented on HBASE-13071:
--

We'll be happy to get guidance as per how to contribute to refguide. 

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel
 Fix For: 2.0.0

 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-14 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543714#comment-14543714
 ] 

Edward Bortnikov commented on HBASE-13071:
--

We'll be happy to get guidance as per how to contribute to refguide. 

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel
 Fix For: 2.0.0

 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-14 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544043#comment-14544043
 ] 

stack commented on HBASE-13071:
---

bq. We'll be happy to get guidance as per how to contribute to refguide.

Make a patch for the refguide -- it is at src/main/asciidoc/ -- in a new issue? 
You'll have to figure where you think it sits best (perf, scan?) Copy/paste 
your release note would make good candidate text.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel
 Fix For: 2.0.0

 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-13 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541789#comment-14541789
 ] 

Edward Bortnikov commented on HBASE-13071:
--

Release note attached, please advise if some different format is expected. 
We are working on the blog - will complete next week, hopefully should not 
preclude commit. 

Thanks [~stack] for volunteering to commit. Which release will this feature 
become candidate for - 1.1, 2.0, or both? 

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541822#comment-14541822
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12732546/Releasenote-13071.txt
  against master branch at commit 220ac141bfcea7798faa5f73295ec61d8b173af9.
  ATTACHMENT ID: 12732546

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+0 tests included{color}.  The patch appears to be a 
documentation, build,
or dev-support patch that doesn't require tests.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/14035//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542396#comment-14542396
 ] 

Hudson commented on HBASE-13071:


SUCCESS: Integrated in HBase-TRUNK #6481 (See 
[https://builds.apache.org/job/HBase-TRUNK/6481/])
HBASE-13071 Hbase Streaming Scan Feature (stack: rev 
86b91997d0590fcf00634e9e90216e77da607fd2)
* 
hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientSimpleScanner.java
* hbase-client/src/main/java/org/apache/hadoop/hbase/client/Scan.java
* 
hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientAsyncPrefetchScanner.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java
* hbase-client/src/main/java/org/apache/hadoop/hbase/client/HTable.java
* 
hbase-client/src/test/java/org/apache/hadoop/hbase/client/TestClientSmallScanner.java
* 
hbase-client/src/main/java/org/apache/hadoop/hbase/client/ReversedClientScanner.java
* 
hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientSmallScanner.java
* 
hbase-client/src/test/java/org/apache/hadoop/hbase/client/TestClientScanner.java
* 
hbase-client/src/test/java/org/apache/hadoop/hbase/client/TestClientSmallReversedScanner.java
* 
hbase-client/src/main/java/org/apache/hadoop/hbase/client/TableConfiguration.java
* hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java


 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel
 Fix For: 2.0.0

 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-13 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542120#comment-14542120
 ] 

stack commented on HBASE-13071:
---

Hopefully this will go into refguide when HBASE-13681 gets attention.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel
 Fix For: 2.0.0

 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-13 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542093#comment-14542093
 ] 

stack commented on HBASE-13071:
---

Very nice release note. I took the contents and inserted them in the release 
note section in this JIRA (For future, see how when you hit 'edit', and if you 
scroll down, there is a 'release note' textbox). I added a sentence on the end 
about more load on server and YMMV.

Is there a place in the refguide where we should shove your release note? Just 
say and I will take care of it.

I committed to master, so 2.0. I tried the branch-1 patch but it failed apply. 
If you update it, I'll apply it to branch-1.

Thank you for the persistence.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
Assignee: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, 
 gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, 
 hits.png, latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-11 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538419#comment-14538419
 ] 

Eshcar Hillel commented on HBASE-13071:
---

2 check styles error added in this patch: (1) forgot to remove redundant import 
in ClientSimpleScanner, (2) added a line to the method loadCache() in 
ClientScanner which caused it to overflow (151 lines).

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-11 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538418#comment-14538418
 ] 

Eshcar Hillel commented on HBASE-13071:
---

2 check styles error added in this patch: (1) forgot to remove redundant import 
in ClientSimpleScanner, (2) added a line to the method loadCache() in 
ClientScanner which caused it to overflow (151 lines).

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-11 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538227#comment-14538227
 ] 

Edward Bortnikov commented on HBASE-13071:
--

Thanks [~stack]. 

We'll post release notes to the jira tomorrow (is this the right destination?), 
and a blog post a tad later (probably, early next week), including the perf 
results. 


 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-10 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537523#comment-14537523
 ] 

stack commented on HBASE-13071:
---

+1 on this last patch. At a minimum, its an nice illustration of what is 
possible. I'll commit in a day or so. Anyone else want to have a look?

A few questions [~eshcar]. 

Do the changes in table-scoped configuration -- the changes in 
TableConfiguration -- make sense? Having Scan defaults -- a client-side op -- 
in the Configuration seems a little overbroad. I seem no harm done since it off 
by default.

Is the checkstyle error from your report? No harm, I can check on commit so 
don't worry about it.

I suggest you write up a fat release note. Release note is probably how folks 
will learn of this feature (unless you do a blog post or something -- which 
might make sense since you have those nice perf findings -- have you redone 
them for this patch that is now size-base?).  If you have done th size-base 
perf analysis, suggest you link to that in the release notes too.

Nice work.



 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537161#comment-14537161
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12731786/HBASE-13071_trunk_rebase_2.0.patch
  against master branch at commit 5a2ca43fa16a95d8db67e5a3d8b48e4d3f3a9aeb.
  ATTACHMENT ID: 12731786

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 16 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1898 checkstyle errors (more than the master's current 1896 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13996//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13996//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13996//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13996//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-10 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537322#comment-14537322
 ] 

Edward Bortnikov commented on HBASE-13071:
--

Community - please review the last patch (covers all the previous requests).

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-06 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531326#comment-14531326
 ] 

Eshcar Hillel commented on HBASE-13071:
---

Aligning with the size-in-bytes basis for scan requests -
here is a snippet of the code to set the cache capacity and to determine 
whether or not to invoke prefetch when next() is called in 
ClientAsyncPrefetchScanner

{code}
  // double buffer - double cache size
  private int calcCacheCapacity() {
int capacity = Integer.MAX_VALUE;
if(caching = 0  caching  (Integer.MAX_VALUE /2)) {
  capacity = caching * 2 + 1;
}
if(capacity == Integer.MAX_VALUE){
  capacity = (int) (maxScannerResultSize / ESTIMATED_SINGLE_RESULT_SIZE);
}
return capacity;
  }

  private boolean prefetchCondition() {
return
(getCacheCount()  getCountThreshold()) 
(getCacheSizeInBytes()  getSizeThreshold()) ;
  }

  private int getCountThreshold() {
return cacheCapacity / 2 ;
  }

  private long getSizeThreshold() {
return maxScannerResultSize / 2 ;
  }
{code}

where cacheSizeInBytes is an AtomicInteger that is updated whenever the cache 
is (increased when adding results to cache, decreased when removing them).

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-03 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526155#comment-14526155
 ] 

stack commented on HBASE-13071:
---

bq. 1. No problem having a per-scan parameter. The assumption is that scans 
should be big in order for the feature to be efficient. 

Good. Can say in javadoc that scan needs to be big to get the benefit.

bq. 2. No problem moving to the size-in-bytes parameter. The API should be 
identical for synchronous and asynchronous clients.

Good.

Bytes is what we do now rather than rows, since this work 
https://blogs.apache.org/hbase/

bq. In the optimistic interpretation, the client would directly relay the API 
parameter to the server. 

What parameter and why go to the server?

Whats wrong w/ optimistic other than client carrying extra data? I'd say go for 
optimistic.

I'd be fine that it 'costs' more on the server as long as tangible benefit.



 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-03 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525825#comment-14525825
 ] 

Edward Bortnikov commented on HBASE-13071:
--


1. No problem having a per-scan parameter. The assumption is that scans should 
be big in order for the feature to be efficient. 
2. No problem moving to the size-in-bytes parameter. The API should be 
identical for synchronous and asynchronous clients. 

Let's agree on the upper-bound parameter semantics (whether rows or bytes). 
Should it be conservative or optimistic? In the optimistic interpretation, the 
client would directly relay the API parameter to the server. A new prefetch 
request is issued when 50% of the old buffer consumed, so when the new buffer 
arrives the old one might not be released yet. This overlap should be short but 
the bound semantics are soft (best-effort). In the conservative interpretation, 
the client would adapt the API parameters, and issue requests for less data, to 
prevent any overflow. For legacy scans, there was no difference because the 
prefetch and computation parts did not overlap. 

Which approach would be better? 

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-05-01 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523816#comment-14523816
 ] 

stack commented on HBASE-13071:
---

bq. I'd suggest leaving the use of this feature manual rather than expecting 
the system to auto-tune.

Ok. But you would have to turn it on globally for the client, right?  You can't 
do it on a per-scan basis.  How hard to add enabling this facility on a 
per-scan basis. It would make it easier to commit this feature if it was not a 
choice between being globally on or off.

This patch also does sizing using (row) caching count. Caching is going to go 
away as first class attribute of Scan in hbase 1.1+ as we have moved to a 
size-in-bytes basis for our scan requests; size-in-bytes would make more sense 
sizing the client-size cache too I'd say. Any plans for moving off the row 
caching basis?

Thanks.



 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-27 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513647#comment-14513647
 ] 

Edward Bortnikov commented on HBASE-13071:
--

I'd suggest leaving the use of this feature manual rather than expecting the 
system to auto-tune. It is often hard to know whether the application requires 
aggressive caching at the client side. For example, consider an application 
that does some tricky aggregation of the scanned data, in which the compute 
part is considerable. There is no way for HBase to know that in advance. The 
optimization does not come for free (up to 2x caching at the client side), so 
IMHO it's up to the application to decide whether to use it.  

Dear community - could you please review and vote on the last patch before it 
becomes obsolete again? The JIRA is still not assigned to any committer. 

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-20 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503549#comment-14503549
 ] 

Eshcar Hillel commented on HBASE-13071:
---

Done rebase.
Thanks to HBASE-13090 next and loadCache methods are separated so this rebase 
wasn't too painful (thanks [~jonathan.lawlor]).
I also changed some new scanner tests to account for the change in scanner 
cache interface (it is now a Queue).

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503897#comment-14503897
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12726649/HBASE-13071_trunk_rebase_1.0.patch
  against master branch at commit 702aea5b38ed6ad0942b0c59c3accca476b46873.
  ATTACHMENT ID: 12726649

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 16 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 protoc{color}.  The applied patch does not increase the 
total number of protoc compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1902 checkstyle errors (more than the master's current 1898 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any  new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13744//testReport/
Release Findbugs (version 2.0.3)warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13744//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13744//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13744//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-16 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497714#comment-14497714
 ] 

Eshcar Hillel commented on HBASE-13071:
---

ClientScanner is an abstract class that bares the code shared by the sync and 
async scanner classes, like the prefetch method.
#prefetch does not replace #next, it is invoked from #next in 
ClientSimpleScanner (the sync scanner) thereby preserving the same sync 
behavior as before. In ClientAsyncPrefetchScanner the prefetch method is 
invoked in the run method of a background thread when the buffer at the client 
side is half full.
I hope this makes sense.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498201#comment-14498201
 ] 

stack commented on HBASE-13071:
---

[~eshcar] I was talking about patch. Your patch no longer applies. Trunk has 
changed (the bit that does not apply is overwrite of next by prefetch...) Sorry 
I was not clear. Would you mind rebasing your patch? Thank you. Pardon my 
letting it rot.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-15 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496916#comment-14496916
 ] 

stack commented on HBASE-13071:
---

Pardon me [~eshcar] but the patch has rotted. I can't make sense of what is 
supposed to be happening in ClientScanner where we remove #next and replace it 
with #prefetch. Help me out. Thanks.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-14 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494748#comment-14494748
 ] 

Eshcar Hillel commented on HBASE-13071:
---

I looked into the PerformanceEvaluation tool, the code is easy to read and 
maintain.
I believe the changes that are required in the implementation of testRow() in 
ScanTest:
  * set caching to 100 (or even to DEFAULT_HBASE_CLIENT_SCANNER_CACHING) 
instead of 30
  * add timeout before calling testScanner.next() [I think you already added 
this one]
  * make sure setFilter(FilterAllFilter) is not invoked
and optionally, add a scanRange10 class to do really big scans

[~stack], do you have by any chance the results of the client latency 
distribution collected by the tool in your previous experiments?

BTW, 30 is not the default value for prefetch size. 
DEFAULT_HBASE_CLIENT_SCANNER_CACHING is set to 100 in 0.98 and to 
Integer.MAX_VALUE in master.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-12 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491898#comment-14491898
 ] 

stack commented on HBASE-13071:
---

bq. the scans should be meaty, with large prefetches (we used 100-1000 
records), and the per-record processing at the client side should be 
non-negligible

What would you suggest then [~ebortnik] and [~eshcar]?  Defaults in hbase are 
30 rows at a time, not 1000. Would it make sense if this facility could be 
turned on by enabling a property on a Scan object?

bq. We are not familiar with the PerformanceEvaluation tool

Np. It is a coarse tool we've been using since early days to run loadings on 
hbase. See bin/hbase pe

bq. Re/ auto-tuning, I believe this is a bit premature. Let's keep the code 
simple, and let the client control. The optimization does not necessarily need 
to be a default.

I suggest auto-tune so the feature is useful more often than not. Regards it 
not needing to be the default, would be cool if user didn't have to go figure 
an opaque option to get this benefit.

Let me try and repro the benefit seen in posted graphs.

Thanks.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-09 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487621#comment-14487621
 ] 

Edward Bortnikov commented on HBASE-13071:
--

Chiming in ... The discussion is becoming loaded, let me summarize up to this 
point, so that we can figure out what's missing. Apologies about the possible 
duplication of what was said before, and might sound obvious. 

The feature is 100% client-side. The metrics we've been measuring are 
client-side as well. Ycsb is the workload generator; [~eshcar] provided the 
source. The network and server hardware are pretty much standard. In order for 
the optimization results to be observable, the scans should be meaty, with 
large prefetches (we used 100-1000 records), and the per-record processing at 
the client side should be non-negligible. In this context, it makes sense to 
mask the network delay by prefetching in the background.

We are not familiar with the PerformanceEvaluation tool. Does it measure 
server-side metrics? If so, it can definitely happen that the server side is 
more congested (and consequently, a bit slower) because many clients move 
faster. Still, the elimination of the stop-and-wait pattern is significant to 
boost the client throughput metrics, as our results suggest. We did not measure 
network congestion, but it's hard to believe that the 1G backbone gets 
congested in this context. 

Re/ auto-tuning, I believe this is a bit premature. Let's keep the code simple, 
and let the client control. The optimization does not necessarily need to be a 
default.

Thanks. 

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-07 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482738#comment-14482738
 ] 

Eshcar Hillel commented on HBASE-13071:
---

Thanks [~stack] for running this rig tests.
I believe the right way to see the benefit of this feature is to measure the 
scan.next() latency at the client side, there you should see the latency going 
down as you increase the delays.
Obviously, an async scanner puts more pressure on the server since the rate it 
is asking for records is higher. Since you are already stress testing the 
server with 50 (heavy scanners) clients, it could be that the extra pressure 
the async clients put on the server push it beyond its peak point.
Other than that, what is the prefetch size you are using? I assume it is less 
than 100. The scenarios in which async scanner would have maximum gain is when 
the client side processing (i.e., delays) are equal to the server side I/O time 
+ network delays. If the prefetch size is too small the network delays are more 
pronounced, and therefore the delays should be longer.

Finally, [~stack]  could you please share the client code you use for your 
tests, either via this Jira or send it directly to me, so I can take a closer 
look, and try it out myself.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-04-07 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483600#comment-14483600
 ] 

stack commented on HBASE-13071:
---

bq. Thanks stack for running this rig tests.

It is my pleasure.

bq. I believe the right way to see the benefit of this feature is to measure 
the scan.next() latency at the client side, there you should see the latency 
going down as you increase the delays.

Let me do this. I will do it with a single process of ten clients only so 
server is not near capacity.

I am using default of 30.  I will up it.

I am using PerformanceEvaluation tool with the scan1000 option. Above I 
describe the dataset I am scanning.

So, [~eshcar], it would seem that this feature would need to be self tuning to 
add general benefit given size of prefetch, client processing time, and other 
factors, all hinder its ability to shine?



 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, 
 gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, 
 latency.delay.png, latency.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-26 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381587#comment-14381587
 ] 

Edward Bortnikov commented on HBASE-13071:
--

+1 on this feature

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, 
 hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-19 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368689#comment-14368689
 ] 

Edward Bortnikov commented on HBASE-13071:
--

I second [~eshcar]. This is not a huge feature, and everybody seems to benefit. 
If there is anything else we should do about the code review - let's do it, and 
race to commit :)

Thanks,
Edward

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, 
 hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367141#comment-14367141
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12705332/HBASE-13071_trunk_10.patch
  against master branch at commit f9a17edc252a88c5a1a2c7764e3f9f65623e0ced.
  ATTACHMENT ID: 12705332

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13294//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, 
 hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-18 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367166#comment-14367166
 ] 

Eshcar Hillel commented on HBASE-13071:
---

Hi everyone,

What would be the next thing to do to get this patch in (now that all the 
lights are green ;) )?

Thanks,
Eshcar

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, 
 hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-18 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366767#comment-14366767
 ] 

Eshcar Hillel commented on HBASE-13071:
---

Yes it's all about setting the delays, but I don't want to change  them to make 
the results look better.They are there just to make the point.

  From: Edward Bortnikov (JIRA) j...@apache.org
 To: esh...@yahoo-inc.com 
 Sent: Monday, March 16, 2015 7:52 AM
 Subject: [jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
   

    [ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362777#comment-14362777
 ] 

Edward Bortnikov commented on HBASE-13071:
--

Eshcar,
Do you have an idea why there are still steps in the async graph? This probably 
means that our delays are not long enough. 
Eddie 


    On Monday, March 16, 2015 1:14 AM, Eshcar Hillel (JIRA) j...@apache.org 
wrote:
  

 
    [ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eshcar Hillel updated HBASE-13071:
--
    Attachment: HBASE-13071_trunk_10.patch




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)



 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, 
 hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362424#comment-14362424
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12704663/HBASE-13071_trunk_9.patch
  against master branch at commit 01bc979ea29e9282786de13c1cb8cbc107e92e9f.
  ATTACHMENT ID: 12704663

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1918 checkstyle errors (more than the master's current 1917 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13254//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, 
 HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, 
 HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-15 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362606#comment-14362606
 ] 

Eshcar Hillel commented on HBASE-13071:
---

New patch is attached.

Also attached the evaluation results for multiple parallel  scanners.
Bottom line, on client side results show similar latency improvement trends for 
multiple async scanners as for a single scanner thread.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, 
 hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-15 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362659#comment-14362659
 ] 

Ted Yu commented on HBASE-13071:


Results shown in the pdf are impressive.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, 
 hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362649#comment-14362649
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12704698/HBASE-13071_trunk_10.patch
  against master branch at commit 0505b7941e175d86004daf9a31ef5ce240d4570f.
  ATTACHMENT ID: 12704698

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 

 {color:red}-1 core zombie tests{color}.  There are 1 zombie test(s):   
at 
org.apache.hadoop.hbase.client.TestHTableMultiplexerFlushCache.testOnRegionChange(TestHTableMultiplexerFlushCache.java:114)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13256//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, 
 hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-15 Thread Edward Bortnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362777#comment-14362777
 ] 

Edward Bortnikov commented on HBASE-13071:
--

Eshcar,
Do you have an idea why there are still steps in the async graph? This probably 
means that our delays are not long enough. 
Eddie 


 On Monday, March 16, 2015 1:14 AM, Eshcar Hillel (JIRA) j...@apache.org 
wrote:
   

 
    [ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eshcar Hillel updated HBASE-13071:
--
    Attachment: HBASE-13071_trunk_10.patch




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)




 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, 
 HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, 
 HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, 
 hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357070#comment-14357070
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12703903/HBASE-13071_trunk_8.patch
  against master branch at commit e66dca6cd1fd91bfa65a7cd4c68acb7a7f6a6c4e.
  ATTACHMENT ID: 12703903

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1926 checkstyle errors (more than the master's current 1924 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:red}-1 site{color}.  The patch appears to cause mvn site goal to 
fail.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13185//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, 
 HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 gc.eshcar.png, hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356691#comment-14356691
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12703862/HBASE-13071_trunk_7.patch
  against master branch at commit b436db7d70c8a90b0167dc0e0120f503efb37e3c.
  ATTACHMENT ID: 12703862

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1926 checkstyle errors (more than the master's current 1924 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:red}-1 site{color}.  The patch appears to cause mvn site goal to 
fail.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapreduce.TestImportTSVWithTTLs
  org.apache.hadoop.hbase.client.TestHCM
  
org.apache.hadoop.hbase.mapreduce.TestImportTSVWithOperationAttributes
  
org.apache.hadoop.hbase.security.access.TestScanEarlyTermination
  
org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientWithRegionReplicas
  org.apache.hadoop.hbase.regionserver.TestFSErrorsExposed
  org.apache.hadoop.hbase.client.TestAdmin1

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13177//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, 
 HBASE-13071_trunk_7.patch, HBaseStreamingScanDesign.pdf, 
 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-10 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355648#comment-14355648
 ] 

Eshcar Hillel commented on HBASE-13071:
---

I think the best way to test this patch is to use the extended version of YCSB 
which supports measuring multi-step operations like scans (see the link to the 
code - I added the code in a separate branch).
The attached evaluation file describes the settings I used in my test.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 gc.eshcar.png, hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355815#comment-14355815
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12703740/HBASE-13071_trunk_6.patch
  against master branch at commit ed93ddd94f6264ca246477bece4bf2c895706a22.
  ATTACHMENT ID: 12703740

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1926 checkstyle errors (more than the master's current 1924 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:red}-1 site{color}.  The patch appears to cause mvn site goal to 
fail.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.procedure.TestProcedureManager

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13165//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 gc.eshcar.png, hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-09 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353865#comment-14353865
 ] 

Eshcar Hillel commented on HBASE-13071:
---

A new patch is attached following the comments by [~jonathan.lawlor] and 
[~stack].

Some notes on implementation and design:
  * The default value is now set to async. (btw, this means async scanner is 
used in multiple tests, which used to have sync scan.)
  * The responsibility to invoke super.close() is now shifted to the pending 
prefetch thread, so it is not missed.
  * In case of sync scanner, the caching parameter indicates both the size of 
the buffer and the chunk size (#rows fetched). In case of async scanner, the 
parameter only indicates the later, while the buffer size is doubled. This 
should now be clear from the documentation, as well as from the new methods 
getCacheCapacity() and getThresholdSize().
  * cache and caching were members of ClientScanner even before this patch. I 
only added the abstract initCache() method. I agree that having two abstract 
classes is not the cleanest solution, but neither is having initCache() in a 
class where not all subclasses have a cache. As I said before, this hierarchy 
can benefit from some re-factoring (the right design might use composition like 
in the strategy pattern instead of inheritance, but all these decisions should 
not be in the scope of the current Jira).

Some notes on performance:
  * This feature is a client side feature and therefore should be tested in 
terms of client side latency.
  * This feature should reduce the latency, and in worse case scenario should 
not increase it (at least not significantly)
  * On the server side I would expect the same behavior as in sync scanner, 
since the same RPC calls are invoked, they only shift earlier in time to have 
the data ready at the client side before the user needs it.   
  * I cannot explain the behavior of the low humps in your test. Do you see 
this consistently? What is the exact setting? Is it a fixed number of scans or 
a fixed time?  

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBASE-13071_trunk_5.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354312#comment-14354312
 ] 

stack commented on HBASE-13071:
---

[~eshcar] How should I test so your patch shines?

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBASE-13071_trunk_5.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354065#comment-14354065
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12703536/HBASE-13071_trunk_5.patch
  against master branch at commit 5025d3aa91d18310fc4d738114ee2b58e48c46c2.
  ATTACHMENT ID: 12703536

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1929 checkstyle errors (more than the master's current 1927 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:red}-1 site{color}.  The patch appears to cause mvn site goal to 
fail.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTSVWithTTLs
  org.apache.hadoop.hbase.regionserver.TestFSErrorsExposed
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.client.TestHCM
  
org.apache.hadoop.hbase.security.access.TestScanEarlyTermination
  
org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientWithRegionReplicas
  org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient
  org.apache.hadoop.hbase.client.TestAdmin1
  org.apache.hadoop.hbase.client.TestFromClientSide
  
org.apache.hadoop.hbase.mapreduce.TestImportTSVWithOperationAttributes

 {color:red}-1 core zombie tests{color}.  There are 1 zombie test(s):   
at 
org.apache.hadoop.hbase.client.TestHTableMultiplexerFlushCache.testOnRegionChange(TestHTableMultiplexerFlushCache.java:114)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13148//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-05 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348984#comment-14348984
 ] 

stack commented on HBASE-13071:
---

bq. Makes sense?

Yes. Thank you. The test 'reads' the result but does not processing, true. Yes, 
30 per batch. What test would you like me to run that makes this feature shine? 
Thanks [~eshcar]

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 gc.eshcar.png, hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-04 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348304#comment-14348304
 ] 

Eshcar Hillel commented on HBASE-13071:
---

I will work on a new version following the comments above (will take a few 
days).

[~stack] I will get back with a full answer to your questions, first I want to 
do some additional perf tests on my side.
The cause of the behavior of the tall humps can be rooted in the way you 
performed the tests. What is the size of the prefetch? 30?
If the tests simply call next in a loop without actually processing the data 
(which is simulated with delays in my tests) then the user exhaust the cache 
very quickly even though the prefetch is done in the background, and therefore 
the behavior is equivalent to a sync scan when the app needs to wait for the 
current prefetch to complete. 
It doesn't need to wait for the prefetch thread to complete loading the cache 
at the client side but this is minor when compared to the round trip time at 
the server side.
As I mentioned before, the assumption underlying this new feature is that the 
processing time at the client side can be balanced by the network and IO at the 
server side. If the processing is short then the network+IO is still a 
bottleneck. Makes sense?

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 gc.eshcar.png, hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345030#comment-14345030
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12702139/HBASE-13071_trunk_3.patch
  against master branch at commit daed00fc98167870463e77b620e9adb6ce9b204d.
  ATTACHMENT ID: 12702139

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.
{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1939 checkstyle errors (more than the master's current 1936 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

 {color:red}-1 core zombie tests{color}.  There are 1 zombie test(s): 

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13061//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-03 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345329#comment-14345329
 ] 

stack commented on HBASE-13071:
---

[~eshcar] Any reason for why your formatting is unorthodox (compared to rest of 
code base?)  There is some here on formatting if that'll help: 
http://hbase.apache.org/book.html#_ides

Please add class comments describing what the class does.

isPrefetchRunning is the name of the method you would call to find the value of 
the boolean prefetchRunning; data members shouldn't have 'is' prefix (javabean 
idiom)

Do we have to add a new executor pool? Could we take one in on construction at 
least optionally (with perhaps the default being we pass in the tables 
executor?).  This could be done in a followup patch.  In general we create too 
many threads in the client and have been trying to go on a diet (but you know 
how diet's go)... in fact you take in a pool on construction...Can you exploit 
this passed-in pool rather than make one of your own?

On close, if a prefetch outstanding, we let it continue rather than interrupt 
it?

We already have AbstractClientScanner. Rather than make ClientScanner also 
abstract, could we not push what ClientScanner has down into ACS? Or add a 
'cache' or 'prefetch' interface that subclasses of ACS could implement?

Your formatting is a little irregular (smile).

IMO this should be ON by default.

I'm trying to get you some pretty pictures to show speedup.  Will be back.

Thanks for the patch.



 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-03 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346017#comment-14346017
 ] 

stack commented on HBASE-13071:
---

Wondering why we have a pool data member though we are passing pool to the 
super class and the super class has a getPool accessor.

On your feedback, understand that you add caching to ClientScanner but why not 
AbstractClientScanner? A hierarchy that is AbstractClientScanner subclassed to 
make a ClientScanner (which is itself abstract) which is subclassed by 
ClientAsyncPrefetchScanner is a little ugly; can we cut out the ClientScanner 
tier?

bq, How would you suggest to get a hold of the thread executing the prefetch, 
so as to interrupt it on close?

You will only ever have a single prefetcher?  If so, executorpool is probably 
overkill?  Just start a single thread that you control?

Formatting irregularities are still in there... 

Pictures coming.. they are provoking interesting questions (smile)


 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-03 Thread Jonathan Lawlor (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346012#comment-14346012
 ] 

Jonathan Lawlor commented on HBASE-13071:
-

[~eshcar] I have added some review below to follow up on stack's comments:

When the defining the capacity for the concurrent queue as below:
{code:title=ClientAsyncPrefetchScanner.java}
...
protected void initCache() {
  // concurrent cache
  // double buffer - double cache size
  cache = new LinkedBlockingQueueResult(this.caching*2 + 1);
}
...
{code}
we need to check the size of caching first to make sure that overflow does not 
occur. For example, in the case that this.caching  Integer.Max_Value / 2, this 
will throw an IllegalArgumentException. This is important in the case that the 
user has configured Scan#caching=Integer.Max_Value and Scan#maxResultSize to be 
a nice chunk size (this configuration is used in instances where the user wants 
to receives responses of a certain heap size from the server rather than 
responses with a certain number of rows).

When close() is called and the prefetch is running we still need to end up 
calling super.close() at some point. In ClientScanner, the call to close() 
ensures that the RegionScanner is closed on the server side so it is important 
that we do not miss this call. 

The javadoc on the async ClientScanner seems to indicate that the prefetch will 
be issued when the cache is half full, but it looks like the cache size check 
is using caching rather than caching / 2. My guess is that the first two calls 
to ClientScanner#next() would both kick off RPC calls. The first would fetch 
the initial chunk containing caching number of rows, and the second call to 
next would kick off a prefetch (since one Result was consumed by first call and 
thus cache size will be caching - 1).

Some javadoc on the async parameter inside Scan.java may be helpful just to 
clarify how the parameter is used. For example, the parameter currently won't 
have any effect in the case that the user has set Scan#setSmall or 
Scan#setReversed

Looks like there may be some minor formatting issues that are still hanging 
around in the latest patch (e.g. Tabs should be 2 spaces instead of 4). You may 
have already seen it, but in the link [~stack] pointed out, there is mention of 
a plugin that can be used with IntelliJ to let eclipse formatters work with it; 
any luck with that? (having the formatter in the IDE avoids headaches :))

Looking forward to getting this one in !

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-03 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345929#comment-14345929
 ] 

Eshcar Hillel commented on HBASE-13071:
---

Thanks [~stack] for your comments, I applied most of them.

  ** The cache is defined in the context of ClientScanner, therefore 
initializing it and the prefetch methods are defined here.
IMHO, the entire hierarchy requires major refactoring (e.g., due to code 
replication), but this should be done in the scope of a different jira :).
  ** How would you suggest to get a hold of the thread executing the prefetch, 
so as to interrupt it on close?
  ** Apologies for the formatting irregularities. I use IntelliJ which fails to 
import the eclipse formatting as suggested in the help page you referred me to.
  ** Waiting (patiently) for the pictures...

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, 
 HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, 
 HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346119#comment-14346119
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12702279/HBASE-13071_trunk_4.patch
  against master branch at commit 524791bcf5d41202b5da9293896078b45067699a.
  ATTACHMENT ID: 12702279

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1940 checkstyle errors (more than the master's current 1935 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/checkstyle-aggregate.html

Javadoc warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13070//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 gc.eshcar.png, hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346239#comment-14346239
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12702279/HBASE-13071_trunk_4.patch
  against master branch at commit 883d6fd8e512b14c967d2f7acf78d2b1d40e40fe.
  ATTACHMENT ID: 12702279

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1940 checkstyle errors (more than the master's current 1935 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

 {color:red}-1 core zombie tests{color}.  There are 1 zombie test(s):   
at 
org.apache.tajo.engine.query.TestJoinQuery.testLeftOuterJoinWithEmptySubquery1(TestJoinQuery.java:473)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/checkstyle-aggregate.html

Javadoc warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13072//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, 
 HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, 
 HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, 
 gc.eshcar.png, hits.eshcar.png, network.png


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342129#comment-14342129
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12701690/HBASE-13071_trunk_1.patch
  against master branch at commit dad2474f08d201d09989e36f5cf1c25d3fa4acee.
  ATTACHMENT ID: 12701690

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.
{color:green}+1 hadoop versions{color}. The patch compiles with all 
supported hadoop versions (2.4.1 2.5.2 2.6.0)

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
1946 checkstyle errors (more than the master's current 1937 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+ClusterConnection connection, 
RpcRetryingCallerFactory rpcCallerFactory,
+
super(configuration,scan,name,connection,rpcCallerFactory,rpcControllerFactory,pool,replicaCallTimeoutMicroSecondScan);
+  public ClientSimpleScanner(Configuration configuration, Scan scan, TableName 
name, ClusterConnection connection,
+ RpcRetryingCallerFactory rpcCallerFactory, 
RpcControllerFactory rpcControllerFactory,
+ ExecutorService pool, int 
replicaCallTimeoutMicroSecondScan) throws IOException {
+
super(configuration,scan,name,connection,rpcCallerFactory,rpcControllerFactory,pool,replicaCallTimeoutMicroSecondScan);
+  public static final String HBASE_CLIENT_SCANNER_ASYNC_PREFETCH = 
hbase.client.scanner.async.prefetch;

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.TestInterfaceAudienceAnnotations
  org.apache.hadoop.hbase.client.TestOperation

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/13021//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-01 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342109#comment-14342109
 ] 

Eshcar Hillel commented on HBASE-13071:
---

New patches for 0.98 and trunk are available.
Link to review board.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-01 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342842#comment-14342842
 ] 

stack commented on HBASE-13071:
---

I tried this and saw improvement. Let me come back with some pretty pictures.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-03-01 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342280#comment-14342280
 ] 

Ted Yu commented on HBASE-13071:


ClientAsyncPrefetchScanner.java and ClientSimpleScanner.java need audience 
annotation.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-27 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341359#comment-14341359
 ] 

Ted Yu commented on HBASE-13071:


There has been effort to stabilize trunk build.

Please prepare patch for trunk.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338439#comment-14338439
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12701068/HBASE-13071-v1.patch
  against master branch at commit 1c957b65b16a8706caee140c18b84ea48a0dc0aa.
  ATTACHMENT ID: 12701068

{color:red}-1 @author{color}.  The patch appears to contain 2 @author tags 
which the Hadoop community has agreed to not allow in code contributions.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12978//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071-v1.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-26 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338596#comment-14338596
 ] 

Eshcar Hillel commented on HBASE-13071:
---

I tried running the code from master, but encountered problems (even without 
the patch).
Specifically, when running commands from hbase shell wasn't able to exit the 
shell properly, therefore created the patch based on 0.98.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338479#comment-14338479
 ] 

Hadoop QA commented on HBASE-13071:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12701075/HBASE-13071-v2.patch
  against master branch at commit 1c957b65b16a8706caee140c18b84ea48a0dc0aa.
  ATTACHMENT ID: 12701075

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified tests.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12979//console

This message is automatically generated.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338571#comment-14338571
 ] 

Ted Yu commented on HBASE-13071:


For ClientAsyncPrefetchScanner.java and ClientSimpleScanner.java:
please add license header
add annotation for audience

Mind putting patch on reviewboard ?

Thanks

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-26 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338581#comment-14338581
 ] 

Eshcar Hillel commented on HBASE-13071:
---

I use the following commands to create the patches
git format-patch 0.98 --minimal --stdout  HBASE-13071-v1.patch 
git diff --no-prefix 0.98  HBASE-13071-v2.patch

Any idea why these can't be applied?

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-26 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338590#comment-14338590
 ] 

Ted Yu commented on HBASE-13071:


Applying patch v2 resulted in the following rejects:
{code}
-rw-r--r--  1 tyu  staff  1803 Feb 26 07:48 
./hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java.rej
-rw-r--r--  1 tyu  staff  730 Feb 26 07:48 
./hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientSmallScanner.java.rej
-rw-r--r--  1 tyu  staff  689 Feb 26 07:48 
./hbase-client/src/main/java/org/apache/hadoop/hbase/client/HTable.java.rej
-rw-r--r--  1 tyu  staff  696 Feb 26 07:48 
./hbase-client/src/main/java/org/apache/hadoop/hbase/client/ReversedClientScanner.java.rej
-rw-r--r--  1 tyu  staff  1374 Feb 26 07:48 
./hbase-client/src/main/java/org/apache/hadoop/hbase/client/Scan.java.rej
-rw-r--r--  1 tyu  staff  1504 Feb 26 07:48 
./hbase-client/src/main/java/org/apache/hadoop/hbase/client/TableConfiguration.java.rej
{code}
Please update your workspace to latest master.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-26 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338599#comment-14338599
 ] 

Eshcar Hillel commented on HBASE-13071:
---

I tried running the code from master, but encountered problems (even without 
the patch).
Specifically, when running commands from hbase shell wasn't able to exit the 
shell properly, therefore created the patch based on 0.98.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, 
 HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-26 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338438#comment-14338438
 ] 

Eshcar Hillel commented on HBASE-13071:
---

I've just attached the patch.
The default value for scanners is sync. This can be easily changed.

Addressing the issue raised by [~jonathan.lawlor]: 
The prefetch logic is the same for sync and async scanners. Therefore, async 
scanner stops RPCs if the max result size is exceeded. However, since the 
prefetch is executed in the background, it is possible that the size of the 
data inside the cache exceeds the max size set by the user (which cannot happen 
with sync scanner). 
There are ways to handle this, but this requires knowing the size of the data 
in the cache at any point and limiting the size of the data retrieved from the 
server with respect to this size. This may reduce the performance gain. 

I plan to attach a patch of the YCSB extension if anyone wants to re-run the 
experiments.


 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.98.11
Reporter: Eshcar Hillel
 Attachments: HBASE-13071-v1.patch, HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-23 Thread Jonathan Lawlor (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334019#comment-14334019
 ] 

Jonathan Lawlor commented on HBASE-13071:
-

Those performance improvements look great [~eshcar]! I agree with [~stack] that 
this configuration should be on by default.

I have been thinking about this asynchronous scanner a bit and was wondering 
how it deals with the case where the user has specified a maximum result size 
via Scan#setMaxResultSize():
* Does the prefetch mechanism attempt to prefetch a certain number of 
rows and loop until that number of rows is retrieved (i.e. does the prefetch 
only stop making prefetch RPCs once it has accumulated the expected number of 
rows)?
* If the prefetch does stop when the maxResultSize limit has been 
exceeded, when would the next prefetch be initiated? When next() is called and 
the cache is observed to have less than half the specified caching value? 
** If that's the case, I was thinking that it may be possible that a 
table with very large rows could potentially create memory issues on the client 
(and this may be a good reason to make this async behavior configurable rather 
that go all-in on async scanners). 
** The scenario that I have in mind is the case where the client is not 
able to hold X rows in memory where X is half of the specified scanner caching 
(i.e. the case where (half_caching_limit * size_of_row) exceeds the avalaible 
memory on the client). The synchronous scanners currently guard the client from 
out of memory exceptions by considering the remaining result size after each 
next RPC -- if the maxResultSize limit is exceeded the scanner stops making 
RPCs, regardless of how many rows the scanner has received from RPCs. However, 
in the case of asynch scanners (if they work how I am thinking they do) each 
call to next() may trigger a prefetch, and each prefetch would receive only Y 
rows, where Y  half_of_caching. The prefetches would be triggered on each 
call to next() until we have accumulated half_of_caching rows in our scanner 
cache and the client may OOM before that limit can be reached
*** In such a case, it may be best to instead use a synch scanner. 
Alternatively, we could keep track of the size of results in our cache and 
allow the user to specifiy maxSizeOfCache so that prefetches aren't continuosly 
fired off on calls to next() when scanning a table with large rows
*** This also raises the question about whether or not it would be 
worthwhile to have an asynchronous scanner that used the size of the Results 
(in memory) as the deciding factor for prefetching -- i.e. If the size of the 
results in the cache is less than half of maxResultSize then perform a prefetch 
for another chunk of maxResultSize worth of values... just a thought

My concern above is definitely a corner case issue, but I thought I'd raise it 
for discussion.

I like the idea that you pointed out towards the end about ramping the prefetch 
size up from a small initial size to the actual prefetch size. Looks like it 
would definitely help ease the initial latency jump seen for large prefetches. 

Really nice results :)

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 

[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-23 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334477#comment-14334477
 ] 

Lars Hofhansl commented on HBASE-13071:
---

Interesting. I had experimented with a blocking queue and a background threads 
a year ago or two and found that the synchronization overhead ate up most of 
the benefits (I did not build in any artificial delay, though).

Eager to see a patch :)

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-22 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332166#comment-14332166
 ] 

Eshcar Hillel commented on HBASE-13071:
---

Thanks all for your comments.

@stack, I wasn't aware of the discussions in the other Jiras, thanks for 
putting the links -- I am now updated.

@ Lars, the concurrent queue in the suggested modification is implemented as a 
LinkedBlockingQueue (which in addition to efficient put and get operations 
provides an efficient count operation). But we can discuss alternatives, 
including devising a dedicated data structure if it looks this can improve 
performance.

The suggested modification focuses on managing the concurrent queue at the 
client side, but still applies the pull model, where the client pulls the 
data from the server.
To support a true streaming, a push model, where the server is pushing the data 
to the client, might be better.
In both cases a concurrent queue is part of the solution. 

I am attaching some evaluation results.
Next step is to provide a patch for 0.98. 

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-22 Thread Eshcar Hillel (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332170#comment-14332170
 ] 

Eshcar Hillel commented on HBASE-13071:
---

This config setting can be easily removed. 
Our main concern was to allow backward compatibility for users, and 
specifically to maintain scan behavior, unless explicitly asked to use 
asynchronous scanner.
Since the asynchronous scanner uses a concurrent data structure which entails 
some overhead, in some cases -- like short scans -- the caller might prefer to 
use a sync scan.


 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: HBaseStreamingScanDesign.pdf, 
 HbaseStreamingScanEvaluation.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-20 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330005#comment-14330005
 ] 

stack commented on HBASE-13071:
---

[~eshcar] Design doc looks straightforward... no continental shifts going on 
here. One question for you is if you are 'up' on all that has gone on in this 
space previous? If not, let me dig it up for you (In particular, an experiment 
that added faux streaming via a hacked up new port on the server that allowed 
client via a new channel get a KV/Cell at a time with nice improvements in 
throughput)

On 'hbase.client.scanner.async.prefetch=true', let me suggest that default is 
that this feature is on, not off, by default. Why would you want current 
behavior if this is available. In fact, IMO, don't bother offering this config 
presuming the throughput is better after this change as I expect it wil be.





 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: HBaseStreamingScanDesign.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-20 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329342#comment-14329342
 ] 

Andrew Purtell commented on HBASE-13071:


This might also be the next incarnation of HBASE-8691

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: HBaseStreamingScanDesign.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-19 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328514#comment-14328514
 ] 

Lars Hofhansl commented on HBASE-13071:
---

Let's close one of these issues.

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: HBaseStreamingScanDesign.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-19 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328607#comment-14328607
 ] 

Lars Hofhansl commented on HBASE-13071:
---

There are many ways to do this:
# managing two buffers, one is filled by a background thread, the other used by 
the client thread, then switched.
# managing a queue on the client. The user thread polls from it, a background 
thread pushed data in as it gets it from the server. A blocking queue makes 
this simple, but comes with synchronization overhead.

In any event, unless we rewrite client and server to support true streaming, it 
means extra buffering of some form regardless of the implementation.


 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: HBaseStreamingScanDesign.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature

2015-02-19 Thread Jonathan Lawlor (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327808#comment-14327808
 ] 

Jonathan Lawlor commented on HBASE-13071:
-

This sounds like a great feature. 

There is some discussion over in HBASE-11544 about the inefficiency of the way 
that the current (synchronous) scanners use the network (which lead to 
HBASE-12994) as well as discussion about how to move Scan RPC's into the realm 
of streaming. This seems like it would address both of those issues and should 
provide some nice performance gains. 

Looking forward to this

 Hbase Streaming Scan Feature
 

 Key: HBASE-13071
 URL: https://issues.apache.org/jira/browse/HBASE-13071
 Project: HBase
  Issue Type: New Feature
Reporter: Eshcar Hillel
 Attachments: HBaseStreamingScanDesign.pdf


 A scan operation iterates over all rows of a table or a subrange of the 
 table. The synchronous nature in which the data is served at the client side 
 hinders the speed the application traverses the data: it increases the 
 overall processing time, and may cause a great variance in the times the 
 application waits for the next piece of data.
 The scanner next() method at the client side invokes an RPC to the 
 regionserver and then stores the results in a cache. The application can 
 specify how many rows will be transmitted per RPC; by default this is set to 
 100 rows. 
 The cache can be considered as a producer-consumer queue, where the hbase 
 client pushes the data to the queue and the application consumes it. 
 Currently this queue is synchronous, i.e., blocking. More specifically, when 
 the application consumed all the data from the cache --- so the cache is 
 empty --- the hbase client retrieves additional data from the server and 
 re-fills the cache with new data. During this time the application is blocked.
 Under the assumption that the application processing time can be balanced by 
 the time it takes to retrieve the data, an asynchronous approach can reduce 
 the time the application is waiting for data.
 We attach a design document.
 We also have a patch that is based on a private branch, and some evaluation 
 results of this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)