[jira] [Commented] (PHOENIX-6501) Use batching when joining data table rows with uncovered index rows
[ https://issues.apache.org/jira/browse/PHOENIX-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502624#comment-17502624 ] Lars Hofhansl commented on PHOENIX-6501: As discussed in PHOENIX-6458, there was an issue with synchronously creating the global index. With that out of the way this seems to work fine. In my test env I didn't see a perf improvement, but that's because everything is local, and so the network is negligible. > Use batching when joining data table rows with uncovered index rows > --- > > Key: PHOENIX-6501 > URL: https://issues.apache.org/jira/browse/PHOENIX-6501 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 5.1.2 >Reporter: Kadir Ozdemir >Assignee: Kadir OZDEMIR >Priority: Major > Attachments: PHOENIX-6501.master.001.patch > > > PHOENIX-6458 extends the existing uncovered local index support for global > indexes. The current solution uses HBase get operations to join data table > rows with uncovered index rows on the server side. Doing a separate RPC call > for every data table row can be expensive. Instead, we can buffer lots of > data row keys in memory, use a skip scan filter and even multiple threads to > issue a separate scan for each data table region in parallel. This will > reduce the cost of join and also improve the performance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PHOENIX-6501) Use batching when joining data table rows with uncovered index rows
[ https://issues.apache.org/jira/browse/PHOENIX-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502556#comment-17502556 ] Lars Hofhansl commented on PHOENIX-6501: That might be a bit tricky. I loaded the TPCH lineitem table (scale factor 3) into Phoenix via the Trino connector. {code} CREATE TABLE phoenix.default.lineitem ( orderkey bigint NOT NULL, partkey bigint, suppkey bigint, linenumber integer NOT NULL, quantity double, extendedprice double, discount double, tax double, returnflag varchar(1), linestatus varchar(1), shipdate date, commitdate date, receiptdate date, shipinstruct varchar(25), shipmode varchar(10), comment varchar(44) ) WITH ( compression = 'ZSTD', data_block_encoding = 'ROW_INDEX_V1', disable_wal = true, immutable_rows = true, rowkeys = 'ORDERKEY,LINENUMBER' ) {code} (I do disable WAL everywhere, because that's not what I am testing and it speeds up loading/creating) Then I created the global index on the tax column. {{create index g_l_tax on lineitem(tax) DISABLE_WAL=true;}} Then I ran {{select /*+ INDEX(lineitem g_l_tax) */ count(suppkey) from lineitem where tax = 0.08}} Let me connect with you offline and see if I can send you a CSV with the lineitem data. > Use batching when joining data table rows with uncovered index rows > --- > > Key: PHOENIX-6501 > URL: https://issues.apache.org/jira/browse/PHOENIX-6501 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 5.1.2 >Reporter: Kadir Ozdemir >Assignee: Kadir OZDEMIR >Priority: Major > Attachments: PHOENIX-6501.master.001.patch > > > PHOENIX-6458 extends the existing uncovered local index support for global > indexes. The current solution uses HBase get operations to join data table > rows with uncovered index rows on the server side. Doing a separate RPC call > for every data table row can be expensive. Instead, we can buffer lots of > data row keys in memory, use a skip scan filter and even multiple threads to > issue a separate scan for each data table region in parallel. This will > reduce the cost of join and also improve the performance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PHOENIX-6501) Use batching when joining data table rows with uncovered index rows
[ https://issues.apache.org/jira/browse/PHOENIX-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502543#comment-17502543 ] Kadir OZDEMIR commented on PHOENIX-6501: [~larsh], Thank you for checking it. Would please post the steps to run the test? I want to run it and see if I can find the root cause. > Use batching when joining data table rows with uncovered index rows > --- > > Key: PHOENIX-6501 > URL: https://issues.apache.org/jira/browse/PHOENIX-6501 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 5.1.2 >Reporter: Kadir Ozdemir >Assignee: Kadir OZDEMIR >Priority: Major > Attachments: PHOENIX-6501.master.001.patch > > > PHOENIX-6458 extends the existing uncovered local index support for global > indexes. The current solution uses HBase get operations to join data table > rows with uncovered index rows on the server side. Doing a separate RPC call > for every data table row can be expensive. Instead, we can buffer lots of > data row keys in memory, use a skip scan filter and even multiple threads to > issue a separate scan for each data table region in parallel. This will > reduce the cost of join and also improve the performance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PHOENIX-6501) Use batching when joining data table rows with uncovered index rows
[ https://issues.apache.org/jira/browse/PHOENIX-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502541#comment-17502541 ] Lars Hofhansl commented on PHOENIX-6501: Testing the attached patch. Running a query on a table with 18m rows, that selects (counts) 2m of them. The query on the uncovered global index *does not finish* (I stopped it after 10 minutes). :( With no index it takes about 7s, with an uncovered local index it takes about 10s (due to the merging cost and low selectivity of the query). So there's some bug somewhere. > Use batching when joining data table rows with uncovered index rows > --- > > Key: PHOENIX-6501 > URL: https://issues.apache.org/jira/browse/PHOENIX-6501 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 5.1.2 >Reporter: Kadir Ozdemir >Assignee: Kadir OZDEMIR >Priority: Major > Attachments: PHOENIX-6501.master.001.patch > > > PHOENIX-6458 extends the existing uncovered local index support for global > indexes. The current solution uses HBase get operations to join data table > rows with uncovered index rows on the server side. Doing a separate RPC call > for every data table row can be expensive. Instead, we can buffer lots of > data row keys in memory, use a skip scan filter and even multiple threads to > issue a separate scan for each data table region in parallel. This will > reduce the cost of join and also improve the performance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PHOENIX-6501) Use batching when joining data table rows with uncovered index rows
[ https://issues.apache.org/jira/browse/PHOENIX-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501638#comment-17501638 ] ASF GitHub Bot commented on PHOENIX-6501: - kadirozde commented on pull request #1399: URL: https://github.com/apache/phoenix/pull/1399#issuecomment-1059669953 > Can we do this for local indexes as well? There is also a significant cost to seeking (even when done locally) > > In fact the only difference might be how we get the a table reference. @lhofhansl, we can definitely do this for local indexes. There are some differences. The first is as you pointed out table vs region. For local indexes, both data and index are accessed via the local region while the global index is accessed remotely so we need to get the table for the global index and then get connections to access the table. The concern of which connection pool to use is not applicable to the local index. The second difference is that there is one table region to retrieve data rows for the local index. However, for the global index, there can be many. So we discover the table region boundaries and access them in parallel for the global indexes using a thread pool, which is not necessary for the local index. The last difference is handling the row key offset for local indexes, which is not necessary for the global indexes. So, I thought instead of lumping local and global index batching together, we should handle them separately. I suggest having a separate Jira and PR for the local index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use batching when joining data table rows with uncovered index rows > --- > > Key: PHOENIX-6501 > URL: https://issues.apache.org/jira/browse/PHOENIX-6501 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 5.1.2 >Reporter: Kadir Ozdemir >Assignee: Kadir OZDEMIR >Priority: Major > > PHOENIX-6458 extends the existing uncovered local index support for global > indexes. The current solution uses HBase get operations to join data table > rows with uncovered index rows on the server side. Doing a separate RPC call > for every data table row can be expensive. Instead, we can buffer lots of > data row keys in memory, use a skip scan filter and even multiple threads to > issue a separate scan for each data table region in parallel. This will > reduce the cost of join and also improve the performance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PHOENIX-6501) Use batching when joining data table rows with uncovered index rows
[ https://issues.apache.org/jira/browse/PHOENIX-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501617#comment-17501617 ] ASF GitHub Bot commented on PHOENIX-6501: - lhofhansl edited a comment on pull request #1399: URL: https://github.com/apache/phoenix/pull/1399#issuecomment-1059635660 Can we do this for local indexes as well? There is also a significant cost to seeking (even when done locally) In fact the only difference might be how we get the a table reference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use batching when joining data table rows with uncovered index rows > --- > > Key: PHOENIX-6501 > URL: https://issues.apache.org/jira/browse/PHOENIX-6501 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 5.1.2 >Reporter: Kadir Ozdemir >Assignee: Kadir OZDEMIR >Priority: Major > > PHOENIX-6458 extends the existing uncovered local index support for global > indexes. The current solution uses HBase get operations to join data table > rows with uncovered index rows on the server side. Doing a separate RPC call > for every data table row can be expensive. Instead, we can buffer lots of > data row keys in memory, use a skip scan filter and even multiple threads to > issue a separate scan for each data table region in parallel. This will > reduce the cost of join and also improve the performance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PHOENIX-6501) Use batching when joining data table rows with uncovered index rows
[ https://issues.apache.org/jira/browse/PHOENIX-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501615#comment-17501615 ] ASF GitHub Bot commented on PHOENIX-6501: - lhofhansl commented on pull request #1399: URL: https://github.com/apache/phoenix/pull/1399#issuecomment-1059635660 Can we do this for local indexes as well? There is also a significant cost to seeking (even when done locally) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use batching when joining data table rows with uncovered index rows > --- > > Key: PHOENIX-6501 > URL: https://issues.apache.org/jira/browse/PHOENIX-6501 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 5.1.2 >Reporter: Kadir Ozdemir >Assignee: Kadir OZDEMIR >Priority: Major > > PHOENIX-6458 extends the existing uncovered local index support for global > indexes. The current solution uses HBase get operations to join data table > rows with uncovered index rows on the server side. Doing a separate RPC call > for every data table row can be expensive. Instead, we can buffer lots of > data row keys in memory, use a skip scan filter and even multiple threads to > issue a separate scan for each data table region in parallel. This will > reduce the cost of join and also improve the performance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PHOENIX-6501) Use batching when joining data table rows with uncovered index rows
[ https://issues.apache.org/jira/browse/PHOENIX-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499817#comment-17499817 ] ASF GitHub Bot commented on PHOENIX-6501: - kadirozde opened a new pull request #1399: URL: https://github.com/apache/phoenix/pull/1399 … index rows -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@phoenix.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Use batching when joining data table rows with uncovered index rows > --- > > Key: PHOENIX-6501 > URL: https://issues.apache.org/jira/browse/PHOENIX-6501 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 5.1.2 >Reporter: Kadir Ozdemir >Assignee: Kadir OZDEMIR >Priority: Major > > PHOENIX-6458 extends the existing uncovered local index support for global > indexes. The current solution uses HBase get operations to join data table > rows with uncovered index rows on the server side. Doing a separate RPC call > for every data table row can be expensive. Instead, we can buffer lots of > data row keys in memory, use a skip scan filter and even multiple threads to > issue a separate scan for each data table region in parallel. This will > reduce the cost of join and also improve the performance. -- This message was sent by Atlassian Jira (v8.20.1#820001)