[jira] [Commented] (HUDI-481) Support SQL-like method

2020-10-21 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218313#comment-17218313
 ] 

liwei commented on HUDI-481:


[~vinoth] agree with you  .

1、 at present can not avoid getting the dataset first. agree with you for  the 
log can just contain the updated col value and we will be able to merge this . 
If we have column statistic or clustering like z-ordering index, this scenario 
can be optimized.

2. I see hudi support spark 3.0 will land it.   We can build the sql API  
HUDI-1297  on spark datasource 2.0 API.  can build under HUDI-1297 

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2020-10-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217972#comment-17217972
 ] 

Vinoth Chandar commented on HUDI-481:
-

>a. If we use a sql to describe, it is {{update table set col1 = X where col2 = 
>Y}}. This is something hudi cannot deal with directly at present, we can only 
>get all the data involved as a dataset first and then merge it.

I don't think we can avoid getting the dataset first i.e read the older parquet 
file to merge the record. In fact, I would argue that Hudi uniquely let's you 
deal with a single column update scenario now, by allowing custom payloads to 
specify merging. i.e base file can contain the entire record and the log can 
just contain the updated col value and we will be able to merge this .

 

What we are missing is the SQL support for Merges, which we should build out 
under HUDI-1297 's scope. wdyt? 

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2020-10-13 Thread cdmikechen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213618#comment-17213618
 ] 

cdmikechen commented on HUDI-481:
-

add a *relates to* by HUDI-1341

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2020-10-13 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213573#comment-17213573
 ] 

liwei commented on HUDI-481:


[~chenxiang] [~x1q1j1]  hi, i also have some plan about this 
https://issues.apache.org/jira/browse/HUDI-1341. We can often discuss :D

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2020-10-13 Thread Forward Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213522#comment-17213522
 ] 

Forward Xu commented on HUDI-481:
-

hi [~chenxiang]There is a part of the syntax that needs to be extended. We can 
further refine it.  Contrast 
https://docs.delta.io/latest/delta-update.html#update-a-table.

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2020-10-13 Thread cdmikechen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213502#comment-17213502
 ] 

cdmikechen commented on HUDI-481:
-

I have create a github project in https://github.com/shangyuantech/hudi-sql . 
Later, I can show some design ideas and usage scenarios on this project .



> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2020-01-11 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013665#comment-17013665
 ] 

Vinoth Chandar commented on HUDI-481:
-

I see.. so if the future 3.x versions will have it, its fine right? we can just 
build based off that? 

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2020-01-09 Thread cdmikechen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012328#comment-17012328
 ] 

cdmikechen commented on HUDI-481:
-

[~vinoth]

I opened Spark's GitHub again this morning, and suddenly found that yesterday I 
was looking at the master branch (spark 3.0). When I switched to version 2.4, 
there was no *UPDATE* or *MERGE* keywords. This shows that spark does not 
support these keywordsin in version 2.4, which may be a problem.

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2020-01-08 Thread cdmikechen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011282#comment-17011282
 ] 

cdmikechen commented on HUDI-481:
-

[~vinoth]

Oh~ I‘ve seen in 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4]
 spark can recognize these keywords. 
 
I have built a project to try to see if it is feasible first. I don't know if 
it is related to V1 or V2. If I have a result, I will let you know as soon as 
possible

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2020-01-07 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010275#comment-17010275
 ] 

Vinoth Chandar commented on HUDI-481:
-

Hi [~chenxiang] , 
[https://github.com/apache/spark/blob/master/docs/sql-keywords.md] does list 
DELETE and UPDATE keywords in the language itself.. I think its upto to the 
datasource to implement this. We can consider this once we move to datasource 
v2 first? Is nt that pre-req for this 

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2020-01-06 Thread cdmikechen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009320#comment-17009320
 ] 

cdmikechen commented on HUDI-481:
-

[~vinoth]
I checked the spark project. It seems that the spark SQL syntax tree only 
supports *DELETE* keyword at present. *UPDATE* and *MERGE* are not supported 
yet. I think this may be because the design idea of spark is to deal with the 
relationship between dataset and dataset. Using existing operators can solve 
similar problems, but it is not sql-like.
My current idea is to build a layer of SQL syntax on the *hudi-core*, and 
properly enable antlr4 to process semantics. For example, the update statement 
can be parsed into first filtering data according to where conditions, and then 
upsert the data into hudi.

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2019-12-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005507#comment-17005507
 ] 

Vinoth Chandar commented on HUDI-481:
-

I am not sure if `CLI` is the right component for this. First few questions 
before I can triage this.. 

 
 * Is this intended to be a Spark API? We have thought about adding support in 
Spark SQL to specify the merge logic vs HoodieRecordPayload interface.. This 
sounds similar. 
 * I think we need to move towards Spark Datasource V2 api first.. and then 
rethink how this will fit in HUDI-30

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)