[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-05-10 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-835134388


   Hi @vinothchandar @umehrot2 , The PR has updated,  mainly with the follow 
changes:
   - Support atomic for CTAS
   - Support use timestamp type as partition field
   - Fix exception when partition column is not in the rightest of select list 
for CTAS.
   - Add `TestSqlStatement` which does a sequence of statements. see 
[sql-statements.sql](https://github.com/apache/hudi/blob/171d607b1adc3972aa2c9e3efce5362368599d00/hudi-spark-datasource/hudi-spark/src/test/resources/sql-statements.sql)
   - Add more test case for CTAS & partitioned table.
   - Change the `SparkSqlAdpater` to `SparkAdapter`
   
   For other issues your have mentioned above, I have filed a JIRA for each.
   - Support Truncate Command For Hoodie 
[1883](https://issues.apache.org/jira/browse/HUDI-1883)
   - Support Partial Update For MergeInto 
[1884](https://issues.apache.org/jira/browse/HUDI-1884)
   - Support Delete/Update Non-pk table 
[1885](https://issues.apache.org/jira/browse/HUDI-1885)
   
   After this first pr has merged, we can continue to solve these JIRAs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-05-10 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-835134388


   Hi @vinothchandar @umehrot2 , The PR has updated,  mainly with the follow 
changes:
   - Support atomic for CTAS
   - Support use timestamp type as partition field
   - Fix exception when partition column is not in the rightest of select list 
for CTAS.
   - Add `TestSqlStatement` which does a sequence of statements. see 
[sql-statements.sql](https://github.com/apache/hudi/pull/2645/files#diff-71c005a921dcea9f712db30bd3376fbc5707d09e9777c583b933072cc64276fd)
   - Add more test case for CTAS & partitioned table.
   - Change the `SparkSqlAdpater` to `SparkAdapter`
   
   For other issues your have mentioned above, I have filed a JIRA for each.
   - Support Truncate Command For Hoodie 
[1883](https://issues.apache.org/jira/browse/HUDI-1883)
   - Support Partial Update For MergeInto 
[1884](https://issues.apache.org/jira/browse/HUDI-1884)
   - Support Delete/Update Non-pk table 
[1885](https://issues.apache.org/jira/browse/HUDI-1885)
   
   After this first pr has merged, we can continue to solve these JIRAs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-05-09 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-835134388


   Hi @vinothchandar @umehrot2 , The PR has updated,  mainly with the follow 
changes:
   - Support atomic for CTAS
   - Support use timestamp type as partition field
   - Fix exception when partition column is not in the rightest of select list 
for CTAS.
   - Add `TestSqlStatement` which does a sequence of statements. see 
   - Add more test case for CTAS & partitioned table.
   - Change the `SparkSqlAdpater` to `SparkAdapter`
   
   For other issues your have mentioned above, I have filed a JIRA for each.
   - Support Truncate Command For Hoodie 
[1883](https://issues.apache.org/jira/browse/HUDI-1883)
   - Support Partial Update For MergeInto 
[1884](https://issues.apache.org/jira/browse/HUDI-1884)
   - Support Delete/Update Non-pk table 
[1885](https://issues.apache.org/jira/browse/HUDI-1885)
   
   After this first pr has merged, we can continue to solve these JIRAs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-05-05 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-833192518


   Hi @vinothchandar , Thanks for your  working on the test.
   
   - CREATE TABLE
   > Even if it fails, it ends up creating the table (i.e its not atomic per se)
   
   Yes, It is not atomic for CTAS currently. I can fix this in this PR later.
   
   > When selecting all columns (probably need more tests across data types)
   
   Yes, I will add more tests across data types.
   
   
   > Truncate table
   
   For `Truncate table`, we need do some work for hudi which is not covered in 
this PR. I will file another PR to solve this.
   
   
   
   - MergeInto
   >1、 Fails due to assignment field/schema mismatch
   
   Currently, merge into cannot support Partial updates, we should specified 
all the fields of the target table in the update set assignments.
   
   >2、 Merges only allowed by PK
   
   Yes, this is currently a limitation of PR to `merge into` as we discussed in 
the RFC-25. I think we can solve this in another PR.
   
   > 3、Merge not updating to new value
   
   This is a same issue with 1,currently we do not support Partial updates.
   
   
   - Delete Table
   
   > Non PK based deletes are not working atm
   
   Currently we cannot support delete or update a non-pk hudi table. For this 
case, we can use the `_hoodie_record_key` to  identify a record and do the 
delete & update. We can file a PR to support this.
   
   > Why do we have to encode column name into reach record key? i.e 
_hoodie_record_key = '1' vs being _hoodie_record_key = 'id:1'
   
   This is hoodie's original behavior for `_hoodie_record_key`.
   
   
   - Create or Replace table
   This has not supported in this PR. I will file a PR for this.
   
   
   - Create table, partitioned by
   > create table hudi_gh_ext using hudi partitioned by (type) location 
'file:///tmp/hudi-gh-ext' as select type, public, payload, repo, actor, org, 
id, other  from gh_raw]
   java.lang.AssertionError: assertion failed
   
   The partitioned column must be on the rightmost side of the SELECT column. 
This is 
   a requirement of Spark SQL. So we should move the `type` to the last select 
column, just like this:
   > create table hudi_gh_ext using hudi partitioned by (type) location 
'file:///tmp/hudi-gh-ext' as select public, payload, repo, actor, org, id, 
other , type from gh_raw.
   
   I will add a translator to move the partition columns to the right most of 
the select list.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-05-05 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-833192518


   Hi @vinothchandar , Thanks for your  working on the test.
   
   - CREATE TABLE
   > Even if it fails, it ends up creating the table (i.e its not atomic per se)
   
   Yes, It is not atomic for CTAS currently. I can fix this in this PR later.
   
   > When selecting all columns (probably need more tests across data types)
   
   Yes, I will add more tests across data types.
   
   
   > Truncate table
   
   For `Truncate table`, we need do some work for hudi which is not covered in 
this PR. I will file another PR to solve this.
   
   
   
   - MergeInto
   >1、 Fails due to assignment field/schema mismatch
   
   Currently, merge into cannot support Partial updates, we should specified 
all the fields of the target table in the update set assignments.
   
   >2、 Merges only allowed by PK
   
   Yes, this is currently a limitation of PR to `merge into` as we discussed in 
the RFC-25. I think we can solve this in another PR.
   
   > 3、Merge not updating to new value
   
   This is a same issue with 1,currently we do not support Partial updates.
   
   
   - Delete Table
   
   > Non PK based deletes are not working atm
   
   Currently we cannot support delete or update a non-pk hudi table. For this 
case, we can use the `_hoodie_record_key` to  identify a record and do the 
delete & update. We can file a PR to support this.
   
   > Why do we have to encode column name into reach record key? i.e 
_hoodie_record_key = '1' vs being _hoodie_record_key = 'id:1'
   
   This is hoodie's original behavior for `_hoodie_record_key`.
   
   
   - Create or Replace table
   This has not supported in this PR. I will file a PR for this.
   
   
   - Create table, partitioned by
   > create table hudi_gh_ext using hudi partitioned by (type) location 
'file:///tmp/hudi-gh-ext' as select type, public, payload, repo, actor, org, 
id, other  from gh_raw]
   java.lang.AssertionError: assertion failed
   
   The partitioned column must be on the rightmost side of the SELECT column. 
This is 
   a requirement of Spark SQL. So we should move the `type` to the last select 
column, just like this:
   > create table hudi_gh_ext using hudi partitioned by (type) location 
'file:///tmp/hudi-gh-ext' as select public, payload, repo, actor, org, id, 
other , type from gh_raw.
   I will add a translator to move the partition columns to the right most of 
the select list.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-05-05 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-833195752


   > @pengzhiwei2018 few suggestions/questions . could you please clarify?
   > 
   > * How does one pass any `HoodieWriteConfig` for a 
INSERT/UPDATE/DELETE/MERGE statement
   > * Setting user specified metadata with each commit. This is very important 
for interplay with deltastreamer etc.
   > * Do we support setting table properties? SET command.
   > * Can we add a functional test that does a sequence of statements and 
parameterize it for both COW/MOR? Want to ensure we can document the entire 
support matrix.
   > * Should we add more rigorous tests for  partitioned tables/data types?
   > 
   > Next steps for me is to run it at a larger scale on a cluster.
   
   
   Hi @vinothchandar 
   > How does one pass any `HoodieWriteConfig` for a INSERT/UPDATE/DELETE/MERGE 
statement
   
   Using the set options. e.g. `set hoodie.insert.shuffle.parallelism = 4`
   
   > Do we support setting table properties? SET command.
   Yes, we support SET Command to set the write config and other runtime 
properties. For table properties like table type, table name, we can use the 
alter table command. 
   
   > Can we add a functional test that does a sequence of statements and 
parameterize it for both COW/MOR? Want to ensure we can document the entire 
support matrix.
   
   Yes, that is great. I will do this.
   
   > Should we add more rigorous tests for partitioned tables/data types?
   
   Of course, I will add more test case to cover more data types.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-28 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-828893245


   > @pengzhiwei2018 could we make the spark-shell experience better? I think 
we need the extensions added by default when the jar is pulled in?
   > 
   > ```scala
   > $ spark-shell --jars $HUDI_SPARK_BUNDLE --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   > 
   > scala> spark.sql("create table t1 (id int, name string, price double, ts 
long) using hudi options(primaryKey= 'id', preCombineField = 'ts')").show 
   > t, returning NoSuchObjectException
   > org.apache.hudi.exception.HoodieException: 'path' or 
'hoodie.datasource.read.paths' or both must be specified.
   >   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:77)
   >   at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:337)
   >   at 
org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
   >   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   >   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   >   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
   >   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
   >   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
   >   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
   >   at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
   >   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
   >   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
   >   at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   >   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
   >   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
   >   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
   >   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
   >   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
   >   at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:606)
   >   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
   >   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601)
   > ```
   
   Hi @vinothchandar , you can test this by the following command
   
   - Using spark-sql
   
   > spark-sql --jars $HUDI_SPARK_BUNDLE \\
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'  \\
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
   
   - Using spark-shell
   
   > spark-shell --jars $HUDI_SPARK_BUNDLE \\
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'  \\
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
   
   
   just set the `spark.sql.extensions` to 
`org.apache.spark.sql.hudi.HoodieSparkSessionExtension`.
   IMO This conf is just like the `spark.serializer` which should be specified 
when create `SparkSession`. So It is hard to auto set this when install the 
hudi jar.
   Thanks~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-28 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-828893245


   > @pengzhiwei2018 could we make the spark-shell experience better? I think 
we need the extensions added by default when the jar is pulled in?
   > 
   > ```scala
   > $ spark-shell --jars $HUDI_SPARK_BUNDLE --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   > 
   > scala> spark.sql("create table t1 (id int, name string, price double, ts 
long) using hudi options(primaryKey= 'id', preCombineField = 'ts')").show 
   > t, returning NoSuchObjectException
   > org.apache.hudi.exception.HoodieException: 'path' or 
'hoodie.datasource.read.paths' or both must be specified.
   >   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:77)
   >   at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:337)
   >   at 
org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
   >   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   >   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   >   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
   >   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
   >   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
   >   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
   >   at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
   >   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
   >   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
   >   at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   >   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
   >   at org.apache.spark.sql.Dataset.(Dataset.scala:229)
   >   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
   >   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
   >   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
   >   at 
org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:606)
   >   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
   >   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601)
   > ```
   
   Hi @vinothchandar , you can test this by the following command
   
   - Using spark-sql
   
   > spark-sql --jars $HUDI_SPARK_BUNDLE \\
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'  \\
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
   
   - Using spark-shell
   
   > spark-shell --jars $HUDI_SPARK_BUNDLE \\
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'  \\
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
   
   
   just set the `spark.sql.extensions` to 
`org.apache.spark.sql.hudi.HoodieSparkSessionExtension`.
   Thanks~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-22 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-824759341


   > @pengzhiwei2018 can we file followups from this review as sub tasks under 
the same umbrella JIRA?
   > 
   > I spent sometime looking at snowflake and bigquery and what kind of 
experience users have there writing data out.
   > Here are my recommendations (mostly borrowing from ANSI SQL)
   > 
   > * [x]  We can support `PRIMARY KEY(col1, col2,..)` definition, if no PK is 
specified we will generate a synthetic key or have it be null.
   > * [ ]  Multi table inserts. `INSERT ALL WHEN condition1 INTO t1 WHEN 
condition2 into t2`
   > * [x]  Update statement `UPDATE t1 SET t1.a = t2.b + 1  FROM t2 WHERE 
condition`
   > * [x]  Merge into statement with matched and not matched clauses.
   > * [x]  Delete from statement
   > * [ ]  Copy INTO statement that integrates with Hudi bootstrap 
functionality
   > * [ ]  CREATE table with support for unique constraint check.
   > * [ ]  ALTER table statement to alter schema constraints.
   > * [ ]  CREATE table with `CLUSTER BY(col1, col2)`
   > * [ ]  CREATE INDEX for adding indexes (future, as we complete RFC-08,27)
   > * [ ]  CREATE table with `FOREIGN KEY`, `DATABASE, SCHEMA` (future plans, 
needs multi table txns + our metaserver)
   > * [ ]  Expose all Hudi table services (cleaning, compaction, clustering, 
.. ) using a `CALL cleaner ` kind of syntax. Over time we can 
expose more standard functions there.  For e.g more advanced compaction and 
clustering strategies call be specified there. We may need a `SHOW services t1` 
to show information for these scheduled calls.
   > 
   > Checked off items I think are already covered in this PR. If not, please 
raise JIRA subtasks for these as well.
   
   That is greate!  I will file a JIRA for each of those have not covered in 
this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-04-14 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-815583131


   Hi @vinothchandar @kwondw , Thanks for the review on this feature. The code 
has updated.
   Main changes:
   - Sql support spark3 based on the same `HoodieAnalysis` and `Commands` with 
spark2. We can pass the test case for spark3 by the following command:  
   `mvn clean install -DskipTests -Pspark3` 
   `mvn test -Punit-tests -Pspark3 -pl hudi-spark-datasource/hudi-spark`
   -  Fix the bug when the source column name is not same with the target table 
column name, MergeInto not work. e.g. `mere into ... on t0.id = s0.s_id`
   -  Support expression on source columns for the merge-on condition. e.g.  
`merge into  on t0.id = s0.id + 1 ...`
   -  Add more test case for `TestMergeInto` & `TestUpdate` & `TestDelete` & 
`TestCreateTable`
   -  Remove the `tableSchema`, use `writeSchema` instead.
   
   Please take a review again when you have time, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-806465222


   > Let me see how/if we can simplify the inputSchema vs writeSchema thing.
   > 
   > I went over the PR now. LGTM at a high level.
   > Few questions though
   > 
   > * I see we are introducing some antlr parsing and inject a custom parse 
for spark 2.x. Is this done for backwards compat with Spark 2 and will be 
eventually removed?
   > * Do we reuse the MERGE/DELETE keywords from Spark 3? Is Spark 3 and Spark 
2 syntax different. Can you comment on how we are approaching all this.
   > * Have you done any production testing of this PR?
   > 
   > cc @kwondw could you also please chime in. We would like to land something 
basic and iterate and get this out for 0.9.0 next month.
   
   Thanks for you review @vinothchandar !
   
   > I see we are introducing some antlr parsing and inject a custom parse for 
spark 2.x. Is this done for backwards compat with Spark 2 and will be 
eventually removed?
   
   Yes, It is for backwards for Spark2 and will be eventually removed for 
spark3 if there are no other syntax extend for the spark3.
   
   > Do we reuse the MERGE/DELETE keywords from Spark 3? Is Spark 3 and Spark 2 
syntax different. Can you comment on how we are approaching all this.
   
   Yes ,I reused the extended syntax( MERGE) from spark 3. So they are 
the same between spark2 and spark3 in the syntax. 
   For spark3, spark can recognize the MERGE/DELTE syntax and parser it to 
LogicalPlan. For spark2, our extended sql parser will also parser it to the 
some LogicalPlan. After the parser, the LogicalPlan will goes to the same 
Rules(In `HoodieAnalysis`) to resolve and rewrite to  Hoodie Command. Hoodie 
Command will translate the logical plan to the hoodie api call.The Hoodie 
Command shared between spark2 & spark 3.
   So except the sql parser for spark2, other parts can share between spark2 & 
spark3.
   
   > Have you done any production testing of this PR?
   
   Yes, I have test it in Aliyun's EMR cluster.  And more test case will be 
done this week.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-24 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-803355840


   > Getting started on this. Sorry for the delay.
   > 
   > How important are the changes around writeSchema vs inputSchema and such 
changes to the SQL implementation?
   
   Hi @vinothchandar ,Thanks for your review.
   It's necessary to introduce the `inputSchema` & `tableSchema` to replace the 
origin `writeSchema` for MergeInto.
   For example:
   
   ```
   Merge Into h0 using (
 select id, name, flag from s) as s0
   on s0.id = h0.id
   when matched and flag ='u' then update set id = s0.name, name = s0.name
   when not matched then insert (id, name) values(s0.id, s0.name)
   ```
   
   The input is `"select id, name, flag from s"` which schema is `(id, name, 
flag)`. But the record write to the table is `(id, name) ` after the 
update translate.  The inputSchema is not equal to the writeSchema. So 
the origin `writeSchema` can not satisfy this scenario.
   I introduce  introduce the `inputSchema` & `tableSchema` to solve this 
problem. The `inputSchema` is used to parse the incoming record and the 
`tableSchema` for write & read record from the table.
   In most case except the MergeInto, The `inputSchema` is the same the 
`tableSchema`,So it should not affect the origin logical, IMO.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org