[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553312#comment-14553312 ] Hudson commented on HBASE-13071: FAILURE: Integrated in HBase-TRUNK #6499 (See [https://builds.apache.org/job/HBase-TRUNK/6499/]) HBASE-13071 synchronous scanner -- cache size-in-bytes bug fix (stack: rev 7f2b33dbbf90474a8f73e4d38ea8f6817ee3dcdb) * hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientAsyncPrefetchScanner.java Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Assignee: Eshcar Hillel Fix For: 2.0.0 Attachments: 99.eshcar.png, HBASE-13071-0_98.patch, HBASE-13071-BRANCH-1.patch, HBASE-13071-trunk-bug-fix.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548148#comment-14548148 ] stack commented on HBASE-13071: --- [~eshcar] Thanks for finding issue. Please open new issue. This one is dense enough already. Thank you (FYI, you do not need to clean up old patches -- thanks). Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Assignee: Eshcar Hillel Fix For: 2.0.0 Attachments: 99.eshcar.png, HBASE-13071-0_98.patch, HBASE-13071-BRANCH-1.patch, HBASE-13071-trunk-bug-fix.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547627#comment-14547627 ] Eshcar Hillel commented on HBASE-13071: --- Hi ~stack, Attached 2 new patches for branch-1 and 0.98. While preparing these patches I discovered that in asynchronous scanner the cache byte-size variable is not updated in one of the places where polling item from the cache. Therefore I also attach a patch to fix this bug in trunk - it is a small local fix in ClientAsyncPrefetchScanner.java (this is already fixed in the patches for branch-1 and 0.98). Will you be able to apply the patches? Also do we need to open a new Jira for the refugee patch or is it ok to post it here? Thanks, Eshcar Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Assignee: Eshcar Hillel Fix For: 2.0.0 Attachments: 99.eshcar.png, HBASE-13071-0_98.patch, HBASE-13071-BRANCH-1.patch, HBASE-13071-trunk-bug-fix.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543715#comment-14543715 ] Edward Bortnikov commented on HBASE-13071: -- We'll be happy to get guidance as per how to contribute to refguide. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Assignee: Eshcar Hillel Fix For: 2.0.0 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543714#comment-14543714 ] Edward Bortnikov commented on HBASE-13071: -- We'll be happy to get guidance as per how to contribute to refguide. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Assignee: Eshcar Hillel Fix For: 2.0.0 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544043#comment-14544043 ] stack commented on HBASE-13071: --- bq. We'll be happy to get guidance as per how to contribute to refguide. Make a patch for the refguide -- it is at src/main/asciidoc/ -- in a new issue? You'll have to figure where you think it sits best (perf, scan?) Copy/paste your release note would make good candidate text. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Assignee: Eshcar Hillel Fix For: 2.0.0 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541789#comment-14541789 ] Edward Bortnikov commented on HBASE-13071: -- Release note attached, please advise if some different format is expected. We are working on the blog - will complete next week, hopefully should not preclude commit. Thanks [~stack] for volunteering to commit. Which release will this feature become candidate for - 1.1, 2.0, or both? Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541822#comment-14541822 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12732546/Releasenote-13071.txt against master branch at commit 220ac141bfcea7798faa5f73295ec61d8b173af9. ATTACHMENT ID: 12732546 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation, build, or dev-support patch that doesn't require tests. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14035//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542396#comment-14542396 ] Hudson commented on HBASE-13071: SUCCESS: Integrated in HBase-TRUNK #6481 (See [https://builds.apache.org/job/HBase-TRUNK/6481/]) HBASE-13071 Hbase Streaming Scan Feature (stack: rev 86b91997d0590fcf00634e9e90216e77da607fd2) * hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientSimpleScanner.java * hbase-client/src/main/java/org/apache/hadoop/hbase/client/Scan.java * hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientAsyncPrefetchScanner.java * hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestScannersFromClientSide.java * hbase-client/src/main/java/org/apache/hadoop/hbase/client/HTable.java * hbase-client/src/test/java/org/apache/hadoop/hbase/client/TestClientSmallScanner.java * hbase-client/src/main/java/org/apache/hadoop/hbase/client/ReversedClientScanner.java * hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientSmallScanner.java * hbase-client/src/test/java/org/apache/hadoop/hbase/client/TestClientScanner.java * hbase-client/src/test/java/org/apache/hadoop/hbase/client/TestClientSmallReversedScanner.java * hbase-client/src/main/java/org/apache/hadoop/hbase/client/TableConfiguration.java * hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Assignee: Eshcar Hillel Fix For: 2.0.0 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542120#comment-14542120 ] stack commented on HBASE-13071: --- Hopefully this will go into refguide when HBASE-13681 gets attention. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Assignee: Eshcar Hillel Fix For: 2.0.0 Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542093#comment-14542093 ] stack commented on HBASE-13071: --- Very nice release note. I took the contents and inserted them in the release note section in this JIRA (For future, see how when you hit 'edit', and if you scroll down, there is a 'release note' textbox). I added a sentence on the end about more load on server and YMMV. Is there a place in the refguide where we should shove your release note? Just say and I will take care of it. I committed to master, so 2.0. I tried the branch-1 patch but it failed apply. If you update it, I'll apply it to branch-1. Thank you for the persistence. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Assignee: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, Releasenote-13071.txt, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538419#comment-14538419 ] Eshcar Hillel commented on HBASE-13071: --- 2 check styles error added in this patch: (1) forgot to remove redundant import in ClientSimpleScanner, (2) added a line to the method loadCache() in ClientScanner which caused it to overflow (151 lines). Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538418#comment-14538418 ] Eshcar Hillel commented on HBASE-13071: --- 2 check styles error added in this patch: (1) forgot to remove redundant import in ClientSimpleScanner, (2) added a line to the method loadCache() in ClientScanner which caused it to overflow (151 lines). Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538227#comment-14538227 ] Edward Bortnikov commented on HBASE-13071: -- Thanks [~stack]. We'll post release notes to the jira tomorrow (is this the right destination?), and a blog post a tad later (probably, early next week), including the perf results. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537523#comment-14537523 ] stack commented on HBASE-13071: --- +1 on this last patch. At a minimum, its an nice illustration of what is possible. I'll commit in a day or so. Anyone else want to have a look? A few questions [~eshcar]. Do the changes in table-scoped configuration -- the changes in TableConfiguration -- make sense? Having Scan defaults -- a client-side op -- in the Configuration seems a little overbroad. I seem no harm done since it off by default. Is the checkstyle error from your report? No harm, I can check on commit so don't worry about it. I suggest you write up a fat release note. Release note is probably how folks will learn of this feature (unless you do a blog post or something -- which might make sense since you have those nice perf findings -- have you redone them for this patch that is now size-base?). If you have done th size-base perf analysis, suggest you link to that in the release notes too. Nice work. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537161#comment-14537161 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12731786/HBASE-13071_trunk_rebase_2.0.patch against master branch at commit 5a2ca43fa16a95d8db67e5a3d8b48e4d3f3a9aeb. ATTACHMENT ID: 12731786 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 16 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1898 checkstyle errors (more than the master's current 1896 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13996//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13996//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13996//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13996//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537322#comment-14537322 ] Edward Bortnikov commented on HBASE-13071: -- Community - please review the last patch (covers all the previous requests). Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBASE-13071_trunk_rebase_2.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531326#comment-14531326 ] Eshcar Hillel commented on HBASE-13071: --- Aligning with the size-in-bytes basis for scan requests - here is a snippet of the code to set the cache capacity and to determine whether or not to invoke prefetch when next() is called in ClientAsyncPrefetchScanner {code} // double buffer - double cache size private int calcCacheCapacity() { int capacity = Integer.MAX_VALUE; if(caching = 0 caching (Integer.MAX_VALUE /2)) { capacity = caching * 2 + 1; } if(capacity == Integer.MAX_VALUE){ capacity = (int) (maxScannerResultSize / ESTIMATED_SINGLE_RESULT_SIZE); } return capacity; } private boolean prefetchCondition() { return (getCacheCount() getCountThreshold()) (getCacheSizeInBytes() getSizeThreshold()) ; } private int getCountThreshold() { return cacheCapacity / 2 ; } private long getSizeThreshold() { return maxScannerResultSize / 2 ; } {code} where cacheSizeInBytes is an AtomicInteger that is updated whenever the cache is (increased when adding results to cache, decreased when removing them). Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526155#comment-14526155 ] stack commented on HBASE-13071: --- bq. 1. No problem having a per-scan parameter. The assumption is that scans should be big in order for the feature to be efficient. Good. Can say in javadoc that scan needs to be big to get the benefit. bq. 2. No problem moving to the size-in-bytes parameter. The API should be identical for synchronous and asynchronous clients. Good. Bytes is what we do now rather than rows, since this work https://blogs.apache.org/hbase/ bq. In the optimistic interpretation, the client would directly relay the API parameter to the server. What parameter and why go to the server? Whats wrong w/ optimistic other than client carrying extra data? I'd say go for optimistic. I'd be fine that it 'costs' more on the server as long as tangible benefit. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525825#comment-14525825 ] Edward Bortnikov commented on HBASE-13071: -- 1. No problem having a per-scan parameter. The assumption is that scans should be big in order for the feature to be efficient. 2. No problem moving to the size-in-bytes parameter. The API should be identical for synchronous and asynchronous clients. Let's agree on the upper-bound parameter semantics (whether rows or bytes). Should it be conservative or optimistic? In the optimistic interpretation, the client would directly relay the API parameter to the server. A new prefetch request is issued when 50% of the old buffer consumed, so when the new buffer arrives the old one might not be released yet. This overlap should be short but the bound semantics are soft (best-effort). In the conservative interpretation, the client would adapt the API parameters, and issue requests for less data, to prevent any overflow. For legacy scans, there was no difference because the prefetch and computation parts did not overlap. Which approach would be better? Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523816#comment-14523816 ] stack commented on HBASE-13071: --- bq. I'd suggest leaving the use of this feature manual rather than expecting the system to auto-tune. Ok. But you would have to turn it on globally for the client, right? You can't do it on a per-scan basis. How hard to add enabling this facility on a per-scan basis. It would make it easier to commit this feature if it was not a choice between being globally on or off. This patch also does sizing using (row) caching count. Caching is going to go away as first class attribute of Scan in hbase 1.1+ as we have moved to a size-in-bytes basis for our scan requests; size-in-bytes would make more sense sizing the client-size cache too I'd say. Any plans for moving off the row caching basis? Thanks. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14513647#comment-14513647 ] Edward Bortnikov commented on HBASE-13071: -- I'd suggest leaving the use of this feature manual rather than expecting the system to auto-tune. It is often hard to know whether the application requires aggressive caching at the client side. For example, consider an application that does some tricky aggregation of the scanned data, in which the compute part is considerable. There is no way for HBase to know that in advance. The optimization does not come for free (up to 2x caching at the client side), so IMHO it's up to the application to decide whether to use it. Dear community - could you please review and vote on the last patch before it becomes obsolete again? The JIRA is still not assigned to any committer. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503549#comment-14503549 ] Eshcar Hillel commented on HBASE-13071: --- Done rebase. Thanks to HBASE-13090 next and loadCache methods are separated so this rebase wasn't too painful (thanks [~jonathan.lawlor]). I also changed some new scanner tests to account for the change in scanner cache interface (it is now a Queue). Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503897#comment-14503897 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726649/HBASE-13071_trunk_rebase_1.0.patch against master branch at commit 702aea5b38ed6ad0942b0c59c3accca476b46873. ATTACHMENT ID: 12726649 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 16 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 protoc{color}. The applied patch does not increase the total number of protoc compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1902 checkstyle errors (more than the master's current 1898 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13744//testReport/ Release Findbugs (version 2.0.3)warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13744//artifact/patchprocess/newFindbugsWarnings.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13744//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13744//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBASE-13071_trunk_rebase_1.0.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497714#comment-14497714 ] Eshcar Hillel commented on HBASE-13071: --- ClientScanner is an abstract class that bares the code shared by the sync and async scanner classes, like the prefetch method. #prefetch does not replace #next, it is invoked from #next in ClientSimpleScanner (the sync scanner) thereby preserving the same sync behavior as before. In ClientAsyncPrefetchScanner the prefetch method is invoked in the run method of a background thread when the buffer at the client side is half full. I hope this makes sense. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498201#comment-14498201 ] stack commented on HBASE-13071: --- [~eshcar] I was talking about patch. Your patch no longer applies. Trunk has changed (the bit that does not apply is overwrite of next by prefetch...) Sorry I was not clear. Would you mind rebasing your patch? Thank you. Pardon my letting it rot. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496916#comment-14496916 ] stack commented on HBASE-13071: --- Pardon me [~eshcar] but the patch has rotted. I can't make sense of what is supposed to be happening in ClientScanner where we remove #next and replace it with #prefetch. Help me out. Thanks. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494748#comment-14494748 ] Eshcar Hillel commented on HBASE-13071: --- I looked into the PerformanceEvaluation tool, the code is easy to read and maintain. I believe the changes that are required in the implementation of testRow() in ScanTest: * set caching to 100 (or even to DEFAULT_HBASE_CLIENT_SCANNER_CACHING) instead of 30 * add timeout before calling testScanner.next() [I think you already added this one] * make sure setFilter(FilterAllFilter) is not invoked and optionally, add a scanRange10 class to do really big scans [~stack], do you have by any chance the results of the client latency distribution collected by the tool in your previous experiments? BTW, 30 is not the default value for prefetch size. DEFAULT_HBASE_CLIENT_SCANNER_CACHING is set to 100 in 0.98 and to Integer.MAX_VALUE in master. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491898#comment-14491898 ] stack commented on HBASE-13071: --- bq. the scans should be meaty, with large prefetches (we used 100-1000 records), and the per-record processing at the client side should be non-negligible What would you suggest then [~ebortnik] and [~eshcar]? Defaults in hbase are 30 rows at a time, not 1000. Would it make sense if this facility could be turned on by enabling a property on a Scan object? bq. We are not familiar with the PerformanceEvaluation tool Np. It is a coarse tool we've been using since early days to run loadings on hbase. See bin/hbase pe bq. Re/ auto-tuning, I believe this is a bit premature. Let's keep the code simple, and let the client control. The optimization does not necessarily need to be a default. I suggest auto-tune so the feature is useful more often than not. Regards it not needing to be the default, would be cool if user didn't have to go figure an opaque option to get this benefit. Let me try and repro the benefit seen in posted graphs. Thanks. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487621#comment-14487621 ] Edward Bortnikov commented on HBASE-13071: -- Chiming in ... The discussion is becoming loaded, let me summarize up to this point, so that we can figure out what's missing. Apologies about the possible duplication of what was said before, and might sound obvious. The feature is 100% client-side. The metrics we've been measuring are client-side as well. Ycsb is the workload generator; [~eshcar] provided the source. The network and server hardware are pretty much standard. In order for the optimization results to be observable, the scans should be meaty, with large prefetches (we used 100-1000 records), and the per-record processing at the client side should be non-negligible. In this context, it makes sense to mask the network delay by prefetching in the background. We are not familiar with the PerformanceEvaluation tool. Does it measure server-side metrics? If so, it can definitely happen that the server side is more congested (and consequently, a bit slower) because many clients move faster. Still, the elimination of the stop-and-wait pattern is significant to boost the client throughput metrics, as our results suggest. We did not measure network congestion, but it's hard to believe that the 1G backbone gets congested in this context. Re/ auto-tuning, I believe this is a bit premature. Let's keep the code simple, and let the client control. The optimization does not necessarily need to be a default. Thanks. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482738#comment-14482738 ] Eshcar Hillel commented on HBASE-13071: --- Thanks [~stack] for running this rig tests. I believe the right way to see the benefit of this feature is to measure the scan.next() latency at the client side, there you should see the latency going down as you increase the delays. Obviously, an async scanner puts more pressure on the server since the rate it is asking for records is higher. Since you are already stress testing the server with 50 (heavy scanners) clients, it could be that the extra pressure the async clients put on the server push it beyond its peak point. Other than that, what is the prefetch size you are using? I assume it is less than 100. The scenarios in which async scanner would have maximum gain is when the client side processing (i.e., delays) are equal to the server side I/O time + network delays. If the prefetch size is too small the network delays are more pronounced, and therefore the delays should be longer. Finally, [~stack] could you please share the client code you use for your tests, either via this Jira or send it directly to me, so I can take a closer look, and try it out myself. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483600#comment-14483600 ] stack commented on HBASE-13071: --- bq. Thanks stack for running this rig tests. It is my pleasure. bq. I believe the right way to see the benefit of this feature is to measure the scan.next() latency at the client side, there you should see the latency going down as you increase the delays. Let me do this. I will do it with a single process of ten clients only so server is not near capacity. I am using default of 30. I will up it. I am using PerformanceEvaluation tool with the scan1000 option. Above I describe the dataset I am scanning. So, [~eshcar], it would seem that this feature would need to be self tuning to add general benefit given size of prefetch, client processing time, and other factors, all hinder its ability to shine? Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.delay.png, gc.eshcar.png, gc.png, hits.delay.png, hits.eshcar.png, hits.png, latency.delay.png, latency.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381587#comment-14381587 ] Edward Bortnikov commented on HBASE-13071: -- +1 on this feature Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368689#comment-14368689 ] Edward Bortnikov commented on HBASE-13071: -- I second [~eshcar]. This is not a huge feature, and everybody seems to benefit. If there is anything else we should do about the code review - let's do it, and race to commit :) Thanks, Edward Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367141#comment-14367141 ] Hadoop QA commented on HBASE-13071: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12705332/HBASE-13071_trunk_10.patch against master branch at commit f9a17edc252a88c5a1a2c7764e3f9f65623e0ced. ATTACHMENT ID: 12705332 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13294//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367166#comment-14367166 ] Eshcar Hillel commented on HBASE-13071: --- Hi everyone, What would be the next thing to do to get this patch in (now that all the lights are green ;) )? Thanks, Eshcar Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366767#comment-14366767 ] Eshcar Hillel commented on HBASE-13071: --- Yes it's all about setting the delays, but I don't want to change  them to make the results look better.They are there just to make the point. From: Edward Bortnikov (JIRA) j...@apache.org To: esh...@yahoo-inc.com Sent: Monday, March 16, 2015 7:52 AM Subject: [jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature   [ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362777#comment-14362777 ] Edward Bortnikov commented on HBASE-13071: -- Eshcar, Do you have an idea why there are still steps in the async graph? This probably means that our delays are not long enough. Eddie   On Monday, March 16, 2015 1:14 AM, Eshcar Hillel (JIRA) j...@apache.org wrote:    [ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eshcar Hillel updated HBASE-13071: --   Attachment: HBASE-13071_trunk_10.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332) -- This message was sent by Atlassian JIRA (v6.3.4#6332) Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362424#comment-14362424 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12704663/HBASE-13071_trunk_9.patch against master branch at commit 01bc979ea29e9282786de13c1cb8cbc107e92e9f. ATTACHMENT ID: 12704663 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1918 checkstyle errors (more than the master's current 1917 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13254//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362606#comment-14362606 ] Eshcar Hillel commented on HBASE-13071: --- New patch is attached. Also attached the evaluation results for multiple parallel scanners. Bottom line, on client side results show similar latency improvement trends for multiple async scanners as for a single scanner thread. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362659#comment-14362659 ] Ted Yu commented on HBASE-13071: Results shown in the pdf are impressive. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362649#comment-14362649 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12704698/HBASE-13071_trunk_10.patch against master branch at commit 0505b7941e175d86004daf9a31ef5ce240d4570f. ATTACHMENT ID: 12704698 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): at org.apache.hadoop.hbase.client.TestHTableMultiplexerFlushCache.testOnRegionChange(TestHTableMultiplexerFlushCache.java:114) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13256//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362777#comment-14362777 ] Edward Bortnikov commented on HBASE-13071: -- Eshcar, Do you have an idea why there are still steps in the async graph? This probably means that our delays are not long enough. Eddie On Monday, March 16, 2015 1:14 AM, Eshcar Hillel (JIRA) j...@apache.org wrote:   [ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eshcar Hillel updated HBASE-13071: --   Attachment: HBASE-13071_trunk_10.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332) Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_10.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBASE-13071_trunk_9.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, HbaseStreamingScanEvaluationwithMultipleClients.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357070#comment-14357070 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703903/HBASE-13071_trunk_8.patch against master branch at commit e66dca6cd1fd91bfa65a7cd4c68acb7a7f6a6c4e. ATTACHMENT ID: 12703903 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1926 checkstyle errors (more than the master's current 1924 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 site{color}. The patch appears to cause mvn site goal to fail. {color:red}-1 core tests{color}. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13185//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBASE-13071_trunk_8.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356691#comment-14356691 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703862/HBASE-13071_trunk_7.patch against master branch at commit b436db7d70c8a90b0167dc0e0120f503efb37e3c. ATTACHMENT ID: 12703862 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1926 checkstyle errors (more than the master's current 1924 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 site{color}. The patch appears to cause mvn site goal to fail. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapreduce.TestImportTSVWithTTLs org.apache.hadoop.hbase.client.TestHCM org.apache.hadoop.hbase.mapreduce.TestImportTSVWithOperationAttributes org.apache.hadoop.hbase.security.access.TestScanEarlyTermination org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientWithRegionReplicas org.apache.hadoop.hbase.regionserver.TestFSErrorsExposed org.apache.hadoop.hbase.client.TestAdmin1 Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13177//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBASE-13071_trunk_7.patch, HBaseStreamingScanDesign.pdf,
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355648#comment-14355648 ] Eshcar Hillel commented on HBASE-13071: --- I think the best way to test this patch is to use the extended version of YCSB which supports measuring multi-step operations like scans (see the link to the code - I added the code in a separate branch). The attached evaluation file describes the settings I used in my test. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355815#comment-14355815 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703740/HBASE-13071_trunk_6.patch against master branch at commit ed93ddd94f6264ca246477bece4bf2c895706a22. ATTACHMENT ID: 12703740 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1926 checkstyle errors (more than the master's current 1924 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 site{color}. The patch appears to cause mvn site goal to fail. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.procedure.TestProcedureManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13165//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBASE-13071_trunk_6.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353865#comment-14353865 ] Eshcar Hillel commented on HBASE-13071: --- A new patch is attached following the comments by [~jonathan.lawlor] and [~stack]. Some notes on implementation and design: * The default value is now set to async. (btw, this means async scanner is used in multiple tests, which used to have sync scan.) * The responsibility to invoke super.close() is now shifted to the pending prefetch thread, so it is not missed. * In case of sync scanner, the caching parameter indicates both the size of the buffer and the chunk size (#rows fetched). In case of async scanner, the parameter only indicates the later, while the buffer size is doubled. This should now be clear from the documentation, as well as from the new methods getCacheCapacity() and getThresholdSize(). * cache and caching were members of ClientScanner even before this patch. I only added the abstract initCache() method. I agree that having two abstract classes is not the cleanest solution, but neither is having initCache() in a class where not all subclasses have a cache. As I said before, this hierarchy can benefit from some re-factoring (the right design might use composition like in the strategy pattern instead of inheritance, but all these decisions should not be in the scope of the current Jira). Some notes on performance: * This feature is a client side feature and therefore should be tested in terms of client side latency. * This feature should reduce the latency, and in worse case scenario should not increase it (at least not significantly) * On the server side I would expect the same behavior as in sync scanner, since the same RPC calls are invoked, they only shift earlier in time to have the data ready at the client side before the user needs it. * I cannot explain the behavior of the low humps in your test. Do you see this consistently? What is the exact setting? Is it a fixed number of scans or a fixed time? Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354312#comment-14354312 ] stack commented on HBASE-13071: --- [~eshcar] How should I test so your patch shines? Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBASE-13071_trunk_5.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354065#comment-14354065 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12703536/HBASE-13071_trunk_5.patch against master branch at commit 5025d3aa91d18310fc4d738114ee2b58e48c46c2. ATTACHMENT ID: 12703536 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1929 checkstyle errors (more than the master's current 1927 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 site{color}. The patch appears to cause mvn site goal to fail. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTSVWithTTLs org.apache.hadoop.hbase.regionserver.TestFSErrorsExposed org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.client.TestHCM org.apache.hadoop.hbase.security.access.TestScanEarlyTermination org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientWithRegionReplicas org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient org.apache.hadoop.hbase.client.TestAdmin1 org.apache.hadoop.hbase.client.TestFromClientSide org.apache.hadoop.hbase.mapreduce.TestImportTSVWithOperationAttributes {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): at org.apache.hadoop.hbase.client.TestHTableMultiplexerFlushCache.testOnRegionChange(TestHTableMultiplexerFlushCache.java:114) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13148//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348984#comment-14348984 ] stack commented on HBASE-13071: --- bq. Makes sense? Yes. Thank you. The test 'reads' the result but does not processing, true. Yes, 30 per batch. What test would you like me to run that makes this feature shine? Thanks [~eshcar] Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348304#comment-14348304 ] Eshcar Hillel commented on HBASE-13071: --- I will work on a new version following the comments above (will take a few days). [~stack] I will get back with a full answer to your questions, first I want to do some additional perf tests on my side. The cause of the behavior of the tall humps can be rooted in the way you performed the tests. What is the size of the prefetch? 30? If the tests simply call next in a loop without actually processing the data (which is simulated with delays in my tests) then the user exhaust the cache very quickly even though the prefetch is done in the background, and therefore the behavior is equivalent to a sync scan when the app needs to wait for the current prefetch to complete. It doesn't need to wait for the prefetch thread to complete loading the cache at the client side but this is minor when compared to the round trip time at the server side. As I mentioned before, the assumption underlying this new feature is that the processing time at the client side can be balanced by the network and IO at the server side. If the processing is short then the network+IO is still a bottleneck. Makes sense? Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345030#comment-14345030 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12702139/HBASE-13071_trunk_3.patch against master branch at commit daed00fc98167870463e77b620e9adb6ce9b204d. ATTACHMENT ID: 12702139 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1939 checkstyle errors (more than the master's current 1936 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13061//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345329#comment-14345329 ] stack commented on HBASE-13071: --- [~eshcar] Any reason for why your formatting is unorthodox (compared to rest of code base?) There is some here on formatting if that'll help: http://hbase.apache.org/book.html#_ides Please add class comments describing what the class does. isPrefetchRunning is the name of the method you would call to find the value of the boolean prefetchRunning; data members shouldn't have 'is' prefix (javabean idiom) Do we have to add a new executor pool? Could we take one in on construction at least optionally (with perhaps the default being we pass in the tables executor?). This could be done in a followup patch. In general we create too many threads in the client and have been trying to go on a diet (but you know how diet's go)... in fact you take in a pool on construction...Can you exploit this passed-in pool rather than make one of your own? On close, if a prefetch outstanding, we let it continue rather than interrupt it? We already have AbstractClientScanner. Rather than make ClientScanner also abstract, could we not push what ClientScanner has down into ACS? Or add a 'cache' or 'prefetch' interface that subclasses of ACS could implement? Your formatting is a little irregular (smile). IMO this should be ON by default. I'm trying to get you some pretty pictures to show speedup. Will be back. Thanks for the patch. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346017#comment-14346017 ] stack commented on HBASE-13071: --- Wondering why we have a pool data member though we are passing pool to the super class and the super class has a getPool accessor. On your feedback, understand that you add caching to ClientScanner but why not AbstractClientScanner? A hierarchy that is AbstractClientScanner subclassed to make a ClientScanner (which is itself abstract) which is subclassed by ClientAsyncPrefetchScanner is a little ugly; can we cut out the ClientScanner tier? bq, How would you suggest to get a hold of the thread executing the prefetch, so as to interrupt it on close? You will only ever have a single prefetcher? If so, executorpool is probably overkill? Just start a single thread that you control? Formatting irregularities are still in there... Pictures coming.. they are provoking interesting questions (smile) Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346012#comment-14346012 ] Jonathan Lawlor commented on HBASE-13071: - [~eshcar] I have added some review below to follow up on stack's comments: When the defining the capacity for the concurrent queue as below: {code:title=ClientAsyncPrefetchScanner.java} ... protected void initCache() { // concurrent cache // double buffer - double cache size cache = new LinkedBlockingQueueResult(this.caching*2 + 1); } ... {code} we need to check the size of caching first to make sure that overflow does not occur. For example, in the case that this.caching Integer.Max_Value / 2, this will throw an IllegalArgumentException. This is important in the case that the user has configured Scan#caching=Integer.Max_Value and Scan#maxResultSize to be a nice chunk size (this configuration is used in instances where the user wants to receives responses of a certain heap size from the server rather than responses with a certain number of rows). When close() is called and the prefetch is running we still need to end up calling super.close() at some point. In ClientScanner, the call to close() ensures that the RegionScanner is closed on the server side so it is important that we do not miss this call. The javadoc on the async ClientScanner seems to indicate that the prefetch will be issued when the cache is half full, but it looks like the cache size check is using caching rather than caching / 2. My guess is that the first two calls to ClientScanner#next() would both kick off RPC calls. The first would fetch the initial chunk containing caching number of rows, and the second call to next would kick off a prefetch (since one Result was consumed by first call and thus cache size will be caching - 1). Some javadoc on the async parameter inside Scan.java may be helpful just to clarify how the parameter is used. For example, the parameter currently won't have any effect in the case that the user has set Scan#setSmall or Scan#setReversed Looks like there may be some minor formatting issues that are still hanging around in the latest patch (e.g. Tabs should be 2 spaces instead of 4). You may have already seen it, but in the link [~stack] pointed out, there is mention of a plugin that can be used with IntelliJ to let eclipse formatters work with it; any luck with that? (having the formatter in the IDE avoids headaches :)) Looking forward to getting this one in ! Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345929#comment-14345929 ] Eshcar Hillel commented on HBASE-13071: --- Thanks [~stack] for your comments, I applied most of them. ** The cache is defined in the context of ClientScanner, therefore initializing it and the prefetch methods are defined here. IMHO, the entire hierarchy requires major refactoring (e.g., due to code replication), but this should be done in the scope of a different jira :). ** How would you suggest to get a hold of the thread executing the prefetch, so as to interrupt it on close? ** Apologies for the formatting irregularities. I use IntelliJ which fails to import the eclipse formatting as suggested in the help page you referred me to. ** Waiting (patiently) for the pictures... Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346119#comment-14346119 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12702279/HBASE-13071_trunk_4.patch against master branch at commit 524791bcf5d41202b5da9293896078b45067699a. ATTACHMENT ID: 12702279 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1940 checkstyle errors (more than the master's current 1935 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/checkstyle-aggregate.html Javadoc warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13070//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346239#comment-14346239 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12702279/HBASE-13071_trunk_4.patch against master branch at commit 883d6fd8e512b14c967d2f7acf78d2b1d40e40fe. ATTACHMENT ID: 12702279 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1940 checkstyle errors (more than the master's current 1935 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): at org.apache.tajo.engine.query.TestJoinQuery.testLeftOuterJoinWithEmptySubquery1(TestJoinQuery.java:473) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/checkstyle-aggregate.html Javadoc warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13072//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: 99.eshcar.png, HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBASE-13071_trunk_2.patch, HBASE-13071_trunk_3.patch, HBASE-13071_trunk_4.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf, gc.eshcar.png, hits.eshcar.png, network.png A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data.
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342129#comment-14342129 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701690/HBASE-13071_trunk_1.patch against master branch at commit dad2474f08d201d09989e36f5cf1c25d3fa4acee. ATTACHMENT ID: 12701690 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions (2.4.1 2.5.2 2.6.0) {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 checkstyle{color}. The applied patch generated 1946 checkstyle errors (more than the master's current 1937 errors). {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: +ClusterConnection connection, RpcRetryingCallerFactory rpcCallerFactory, + super(configuration,scan,name,connection,rpcCallerFactory,rpcControllerFactory,pool,replicaCallTimeoutMicroSecondScan); + public ClientSimpleScanner(Configuration configuration, Scan scan, TableName name, ClusterConnection connection, + RpcRetryingCallerFactory rpcCallerFactory, RpcControllerFactory rpcControllerFactory, + ExecutorService pool, int replicaCallTimeoutMicroSecondScan) throws IOException { + super(configuration,scan,name,connection,rpcCallerFactory,rpcControllerFactory,pool,replicaCallTimeoutMicroSecondScan); + public static final String HBASE_CLIENT_SCANNER_ASYNC_PREFETCH = hbase.client.scanner.async.prefetch; {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.TestInterfaceAudienceAnnotations org.apache.hadoop.hbase.client.TestOperation Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/13021//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter:
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342109#comment-14342109 ] Eshcar Hillel commented on HBASE-13071: --- New patches for 0.98 and trunk are available. Link to review board. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342842#comment-14342842 ] stack commented on HBASE-13071: --- I tried this and saw improvement. Let me come back with some pretty pictures. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342280#comment-14342280 ] Ted Yu commented on HBASE-13071: ClientAsyncPrefetchScanner.java and ClientSimpleScanner.java need audience annotation. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071_98_1.patch, HBASE-13071_trunk_1.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341359#comment-14341359 ] Ted Yu commented on HBASE-13071: There has been effort to stabilize trunk build. Please prepare patch for trunk. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338439#comment-14338439 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701068/HBASE-13071-v1.patch against master branch at commit 1c957b65b16a8706caee140c18b84ea48a0dc0aa. ATTACHMENT ID: 12701068 {color:red}-1 @author{color}. The patch appears to contain 2 @author tags which the Hadoop community has agreed to not allow in code contributions. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/12978//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071-v1.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338596#comment-14338596 ] Eshcar Hillel commented on HBASE-13071: --- I tried running the code from master, but encountered problems (even without the patch). Specifically, when running commands from hbase shell wasn't able to exit the shell properly, therefore created the patch based on 0.98. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338479#comment-14338479 ] Hadoop QA commented on HBASE-13071: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701075/HBASE-13071-v2.patch against master branch at commit 1c957b65b16a8706caee140c18b84ea48a0dc0aa. ATTACHMENT ID: 12701075 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/12979//console This message is automatically generated. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338571#comment-14338571 ] Ted Yu commented on HBASE-13071: For ClientAsyncPrefetchScanner.java and ClientSimpleScanner.java: please add license header add annotation for audience Mind putting patch on reviewboard ? Thanks Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338581#comment-14338581 ] Eshcar Hillel commented on HBASE-13071: --- I use the following commands to create the patches git format-patch 0.98 --minimal --stdout HBASE-13071-v1.patch git diff --no-prefix 0.98 HBASE-13071-v2.patch Any idea why these can't be applied? Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338590#comment-14338590 ] Ted Yu commented on HBASE-13071: Applying patch v2 resulted in the following rejects: {code} -rw-r--r-- 1 tyu staff 1803 Feb 26 07:48 ./hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java.rej -rw-r--r-- 1 tyu staff 730 Feb 26 07:48 ./hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClientSmallScanner.java.rej -rw-r--r-- 1 tyu staff 689 Feb 26 07:48 ./hbase-client/src/main/java/org/apache/hadoop/hbase/client/HTable.java.rej -rw-r--r-- 1 tyu staff 696 Feb 26 07:48 ./hbase-client/src/main/java/org/apache/hadoop/hbase/client/ReversedClientScanner.java.rej -rw-r--r-- 1 tyu staff 1374 Feb 26 07:48 ./hbase-client/src/main/java/org/apache/hadoop/hbase/client/Scan.java.rej -rw-r--r-- 1 tyu staff 1504 Feb 26 07:48 ./hbase-client/src/main/java/org/apache/hadoop/hbase/client/TableConfiguration.java.rej {code} Please update your workspace to latest master. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338599#comment-14338599 ] Eshcar Hillel commented on HBASE-13071: --- I tried running the code from master, but encountered problems (even without the patch). Specifically, when running commands from hbase shell wasn't able to exit the shell properly, therefore created the patch based on 0.98. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071-v1.patch, HBASE-13071-v2.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338438#comment-14338438 ] Eshcar Hillel commented on HBASE-13071: --- I've just attached the patch. The default value for scanners is sync. This can be easily changed. Addressing the issue raised by [~jonathan.lawlor]: The prefetch logic is the same for sync and async scanners. Therefore, async scanner stops RPCs if the max result size is exceeded. However, since the prefetch is executed in the background, it is possible that the size of the data inside the cache exceeds the max size set by the user (which cannot happen with sync scanner). There are ways to handle this, but this requires knowing the size of the data in the cache at any point and limiting the size of the data retrieved from the server with respect to this size. This may reduce the performance gain. I plan to attach a patch of the YCSB extension if anyone wants to re-run the experiments. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Affects Versions: 0.98.11 Reporter: Eshcar Hillel Attachments: HBASE-13071-v1.patch, HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334019#comment-14334019 ] Jonathan Lawlor commented on HBASE-13071: - Those performance improvements look great [~eshcar]! I agree with [~stack] that this configuration should be on by default. I have been thinking about this asynchronous scanner a bit and was wondering how it deals with the case where the user has specified a maximum result size via Scan#setMaxResultSize(): * Does the prefetch mechanism attempt to prefetch a certain number of rows and loop until that number of rows is retrieved (i.e. does the prefetch only stop making prefetch RPCs once it has accumulated the expected number of rows)? * If the prefetch does stop when the maxResultSize limit has been exceeded, when would the next prefetch be initiated? When next() is called and the cache is observed to have less than half the specified caching value? ** If that's the case, I was thinking that it may be possible that a table with very large rows could potentially create memory issues on the client (and this may be a good reason to make this async behavior configurable rather that go all-in on async scanners). ** The scenario that I have in mind is the case where the client is not able to hold X rows in memory where X is half of the specified scanner caching (i.e. the case where (half_caching_limit * size_of_row) exceeds the avalaible memory on the client). The synchronous scanners currently guard the client from out of memory exceptions by considering the remaining result size after each next RPC -- if the maxResultSize limit is exceeded the scanner stops making RPCs, regardless of how many rows the scanner has received from RPCs. However, in the case of asynch scanners (if they work how I am thinking they do) each call to next() may trigger a prefetch, and each prefetch would receive only Y rows, where Y half_of_caching. The prefetches would be triggered on each call to next() until we have accumulated half_of_caching rows in our scanner cache and the client may OOM before that limit can be reached *** In such a case, it may be best to instead use a synch scanner. Alternatively, we could keep track of the size of results in our cache and allow the user to specifiy maxSizeOfCache so that prefetches aren't continuosly fired off on calls to next() when scanning a table with large rows *** This also raises the question about whether or not it would be worthwhile to have an asynchronous scanner that used the size of the Results (in memory) as the deciding factor for prefetching -- i.e. If the size of the results in the cache is less than half of maxResultSize then perform a prefetch for another chunk of maxResultSize worth of values... just a thought My concern above is definitely a corner case issue, but I thought I'd raise it for discussion. I like the idea that you pointed out towards the end about ramping the prefetch size up from a small initial size to the actual prefetch size. Looks like it would definitely help ease the initial latency jump seen for large prefetches. Really nice results :) Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334477#comment-14334477 ] Lars Hofhansl commented on HBASE-13071: --- Interesting. I had experimented with a blocking queue and a background threads a year ago or two and found that the synchronization overhead ate up most of the benefits (I did not build in any artificial delay, though). Eager to see a patch :) Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332166#comment-14332166 ] Eshcar Hillel commented on HBASE-13071: --- Thanks all for your comments. @stack, I wasn't aware of the discussions in the other Jiras, thanks for putting the links -- I am now updated. @ Lars, the concurrent queue in the suggested modification is implemented as a LinkedBlockingQueue (which in addition to efficient put and get operations provides an efficient count operation). But we can discuss alternatives, including devising a dedicated data structure if it looks this can improve performance. The suggested modification focuses on managing the concurrent queue at the client side, but still applies the pull model, where the client pulls the data from the server. To support a true streaming, a push model, where the server is pushing the data to the client, might be better. In both cases a concurrent queue is part of the solution. I am attaching some evaluation results. Next step is to provide a patch for 0.98. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332170#comment-14332170 ] Eshcar Hillel commented on HBASE-13071: --- This config setting can be easily removed. Our main concern was to allow backward compatibility for users, and specifically to maintain scan behavior, unless explicitly asked to use asynchronous scanner. Since the asynchronous scanner uses a concurrent data structure which entails some overhead, in some cases -- like short scans -- the caller might prefer to use a sync scan. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseStreamingScanDesign.pdf, HbaseStreamingScanEvaluation.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330005#comment-14330005 ] stack commented on HBASE-13071: --- [~eshcar] Design doc looks straightforward... no continental shifts going on here. One question for you is if you are 'up' on all that has gone on in this space previous? If not, let me dig it up for you (In particular, an experiment that added faux streaming via a hacked up new port on the server that allowed client via a new channel get a KV/Cell at a time with nice improvements in throughput) On 'hbase.client.scanner.async.prefetch=true', let me suggest that default is that this feature is on, not off, by default. Why would you want current behavior if this is available. In fact, IMO, don't bother offering this config presuming the throughput is better after this change as I expect it wil be. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseStreamingScanDesign.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329342#comment-14329342 ] Andrew Purtell commented on HBASE-13071: This might also be the next incarnation of HBASE-8691 Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseStreamingScanDesign.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328514#comment-14328514 ] Lars Hofhansl commented on HBASE-13071: --- Let's close one of these issues. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseStreamingScanDesign.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328607#comment-14328607 ] Lars Hofhansl commented on HBASE-13071: --- There are many ways to do this: # managing two buffers, one is filled by a background thread, the other used by the client thread, then switched. # managing a queue on the client. The user thread polls from it, a background thread pushed data in as it gets it from the server. A blocking queue makes this simple, but comes with synchronization overhead. In any event, unless we rewrite client and server to support true streaming, it means extra buffering of some form regardless of the implementation. Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseStreamingScanDesign.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13071) Hbase Streaming Scan Feature
[ https://issues.apache.org/jira/browse/HBASE-13071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327808#comment-14327808 ] Jonathan Lawlor commented on HBASE-13071: - This sounds like a great feature. There is some discussion over in HBASE-11544 about the inefficiency of the way that the current (synchronous) scanners use the network (which lead to HBASE-12994) as well as discussion about how to move Scan RPC's into the realm of streaming. This seems like it would address both of those issues and should provide some nice performance gains. Looking forward to this Hbase Streaming Scan Feature Key: HBASE-13071 URL: https://issues.apache.org/jira/browse/HBASE-13071 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseStreamingScanDesign.pdf A scan operation iterates over all rows of a table or a subrange of the table. The synchronous nature in which the data is served at the client side hinders the speed the application traverses the data: it increases the overall processing time, and may cause a great variance in the times the application waits for the next piece of data. The scanner next() method at the client side invokes an RPC to the regionserver and then stores the results in a cache. The application can specify how many rows will be transmitted per RPC; by default this is set to 100 rows. The cache can be considered as a producer-consumer queue, where the hbase client pushes the data to the queue and the application consumes it. Currently this queue is synchronous, i.e., blocking. More specifically, when the application consumed all the data from the cache --- so the cache is empty --- the hbase client retrieves additional data from the server and re-fills the cache with new data. During this time the application is blocked. Under the assumption that the application processing time can be balanced by the time it takes to retrieve the data, an asynchronous approach can reduce the time the application is waiting for data. We attach a design document. We also have a patch that is based on a private branch, and some evaluation results of this code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)