[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218313#comment-17218313 ] liwei commented on HUDI-481: [~vinoth] agree with you . 1、 at present can not avoid getting the dataset first. agree with you for the log can just contain the updated col value and we will be able to merge this . If we have column statistic or clustering like z-ordering index, this scenario can be optimized. 2. I see hudi support spark 3.0 will land it. We can build the sql API HUDI-1297 on spark datasource 2.0 API. can build under HUDI-1297 > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217972#comment-17217972 ] Vinoth Chandar commented on HUDI-481: - >a. If we use a sql to describe, it is {{update table set col1 = X where col2 = >Y}}. This is something hudi cannot deal with directly at present, we can only >get all the data involved as a dataset first and then merge it. I don't think we can avoid getting the dataset first i.e read the older parquet file to merge the record. In fact, I would argue that Hudi uniquely let's you deal with a single column update scenario now, by allowing custom payloads to specify merging. i.e base file can contain the entire record and the log can just contain the updated col value and we will be able to merge this . What we are missing is the SQL support for Merges, which we should build out under HUDI-1297 's scope. wdyt? > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213618#comment-17213618 ] cdmikechen commented on HUDI-481: - add a *relates to* by HUDI-1341 > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213573#comment-17213573 ] liwei commented on HUDI-481: [~chenxiang] [~x1q1j1] hi, i also have some plan about this https://issues.apache.org/jira/browse/HUDI-1341. We can often discuss :D > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213522#comment-17213522 ] Forward Xu commented on HUDI-481: - hi [~chenxiang]There is a part of the syntax that needs to be extended. We can further refine it. Contrast https://docs.delta.io/latest/delta-update.html#update-a-table. > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213502#comment-17213502 ] cdmikechen commented on HUDI-481: - I have create a github project in https://github.com/shangyuantech/hudi-sql . Later, I can show some design ideas and usage scenarios on this project . > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013665#comment-17013665 ] Vinoth Chandar commented on HUDI-481: - I see.. so if the future 3.x versions will have it, its fine right? we can just build based off that? > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012328#comment-17012328 ] cdmikechen commented on HUDI-481: - [~vinoth] I opened Spark's GitHub again this morning, and suddenly found that yesterday I was looking at the master branch (spark 3.0). When I switched to version 2.4, there was no *UPDATE* or *MERGE* keywords. This shows that spark does not support these keywordsin in version 2.4, which may be a problem. > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011282#comment-17011282 ] cdmikechen commented on HUDI-481: - [~vinoth] Oh~ I‘ve seen in [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4] spark can recognize these keywords. I have built a project to try to see if it is feasible first. I don't know if it is related to V1 or V2. If I have a result, I will let you know as soon as possible > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010275#comment-17010275 ] Vinoth Chandar commented on HUDI-481: - Hi [~chenxiang] , [https://github.com/apache/spark/blob/master/docs/sql-keywords.md] does list DELETE and UPDATE keywords in the language itself.. I think its upto to the datasource to implement this. We can consider this once we move to datasource v2 first? Is nt that pre-req for this > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009320#comment-17009320 ] cdmikechen commented on HUDI-481: - [~vinoth] I checked the spark project. It seems that the spark SQL syntax tree only supports *DELETE* keyword at present. *UPDATE* and *MERGE* are not supported yet. I think this may be because the design idea of spark is to deal with the relationship between dataset and dataset. Using existing operators can solve similar problems, but it is not sql-like. My current idea is to build a layer of SQL syntax on the *hudi-core*, and properly enable antlr4 to process semantics. For example, the update statement can be parsed into first filtering data according to where conditions, and then upsert the data into hudi. > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-481) Support SQL-like method
[ https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005507#comment-17005507 ] Vinoth Chandar commented on HUDI-481: - I am not sure if `CLI` is the right component for this. First few questions before I can triage this.. * Is this intended to be a Spark API? We have thought about adding support in Spark SQL to specify the merge logic vs HoodieRecordPayload interface.. This sounds similar. * I think we need to move towards Spark Datasource V2 api first.. and then rethink how this will fit in HUDI-30 > Support SQL-like method > --- > > Key: HUDI-481 > URL: https://issues.apache.org/jira/browse/HUDI-481 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: CLI >Reporter: cdmikechen >Priority: Minor > > As we know, Hudi use spark datasource api to upsert data. For example, if we > want to update a data, we need to get the old row's data first, and use > upsert method to update this row. > But there's another situation where someone just wants to update one column > of data. If we use a sql to describe, it is {{update table set col1 = X where > col2 = Y}}. This is something hudi cannot deal with directly at present, we > can only get all the data involved as a dataset first and then merge it. > So I think maybe we can create a new subproject to process the batch data in > an sql-like method. For example. > {code} > val hudiTable = new HudiTable(path) > hudiTable.update.set("col1 = X").where("col2 = Y") > hudiTable.delete.where("col3 = Z") > hudiTable.commit > {code} > It may also extend the functionality and support jdbc-like RFC schemes: > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller] > Hope every one can provide some suggestions to see if this plan is feasible. -- This message was sent by Atlassian Jira (v8.3.4#803005)