[GitHub] [hudi] pengzhiwei2018 commented on pull request #3393: [HUDI-1842] Spark Sql Support For The Exists Hoodie Table

2021-08-06 Thread GitBox


pengzhiwei2018 commented on pull request #3393:
URL: https://github.com/apache/hudi/pull/3393#issuecomment-894607773


   @hudi-bot run azure
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on pull request #3393: [HUDI-1842] Spark Sql Support For The Exists Hoodie Table

2021-08-06 Thread GitBox


pengzhiwei2018 commented on pull request #3393:
URL: https://github.com/apache/hudi/pull/3393#issuecomment-894603173


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on pull request #3393: [HUDI-1842] Spark Sql Support For The Exists Hoodie Table

2021-08-05 Thread GitBox


pengzhiwei2018 commented on pull request #3393:
URL: https://github.com/apache/hudi/pull/3393#issuecomment-892526491






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on pull request #3393: [HUDI-1842] Spark Sql Support For The Exists Hoodie Table

2021-08-04 Thread GitBox


pengzhiwei2018 commented on pull request #3393:
URL: https://github.com/apache/hudi/pull/3393#issuecomment-893180101


   > > > > Hey peng. I did a round of testing on this patch. Here are my 
findings.
   > > > > Insert into is till prefixing col name to meta fields. (3rd col and 
4th col)
   > > > > ```
   > > > > select * from hudi_ny where tpep_pickup_datetime like '%00:04:03%';
   > > > > 20210802105420   20210802105420_2_23 2019-01-01 00:04:03 
2019-01-01  
c5e6a617-dfc5-4051-8c1a-8daead3847af-0_2-37-62_20210802105420.parquet   2   
2019-01-01 00:04:03 2019-01-01 00:11:48 1   3.011   N   
137 262 1   10.00.5 0.5 2.260.0 0.3 13.56   
NULL2019-01-01
   > > > > 20210803162030   20210803162030_0_1  
tpep_pickup_datetime:2021-01-01 00:04:03date_col=2021-01-01 
c5c72f9e-9a63-48ca-a981-4302890f5210-0_0-27-1635_20210803162030.parquet 2   
2021-01-01 00:04:03 2021-01-01 00:11:48 1   3.011   N   
137 262 10.00.5 0.5 2.260.0 0.3 13.56   NULL
2021-01-01
   > > > > Time taken: 0.524 seconds, Fetched 2 row(s)
   > > > > ```
   > > > > 
   > > > > 
   > > > > 
   > > > >   
   > > > > 
   > > > > 
   > > > >   
   > > > > 
   > > > > 
   > > > > 
   > > > >   
   > > > > 1st row was part of the table before onboarding to spark-sql.
   > > > > 2nd row was inserted using insert into.
   > > > > Hi @nsivabalan , I know the difference now. The spark sql use the 
`SqlKeyGenerator` which is a sub-class of `ComplexKeyGenerator` to generated 
record key which will add the column name to the record key. While the 
`SimpleKeyGenerator` will not do that. So we should keep the behavior the same 
for `ComplexKeyGenerator` and `SimpleKeyGenerator`.
   > > > 
   > > > 
   > > > sorry, I don't get you. I understand SqlKeyGenerator extends from 
ComplexKeyGen. but why do we need to keep the same for SimpleKeyGen? We should 
not add any field prefix for SimpleKeyGen. If not, no updates will work for an 
existing table.
   > > 
   > > 
   > > Hi @nsivabalan , Have solved the record key not matched issue. Please 
take a test again~
   > 
   > @pengzhiwei2018 I tested the patch. I can see the column names are no 
longer being prefixed. Updates and deletes by record key is working fine now. 
However., the uri encoding of partition path is still an issue. For example, I 
did an insert to an existing partition. The insert was successful but it 
created a new partition as below:
   > 
   > ```
   > insert into hudi_trips_cow values(1.0, 2.0, "driver_2", 3.0, 4.0, 100.0, 
"rider_2", 12345, "765544i-e89b-12d3-a456-42665544", 
"americas/united_states/san_francisco/");
   > 
   > % ls -l /private/tmp/hudi_trips_cow
   > total 0
   > drwxr-xr-x  4 sagars  wheel  128 Aug  4 16:49 americas
   > drwxr-xr-x  6 sagars  wheel  192 Aug  4 16:50 
americas%2Funited_states%2Fsan_francisco%2F
   > drwxr-xr-x  3 sagars  wheel   96 Aug  4 16:49 asia
   > ```
   
   Hi @codope , can you drop the table and create again with the latest code of 
this patch. I am afraid this happen because you create the table by the old 
code of the patch. I have fix this issue in the latest .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on pull request #3393: [HUDI-1842] Spark Sql Support For The Exists Hoodie Table

2021-08-04 Thread GitBox


pengzhiwei2018 commented on pull request #3393:
URL: https://github.com/apache/hudi/pull/3393#issuecomment-892526491


   > > Hey peng. I did a round of testing on this patch. Here are my findings.
   > > Insert into is till prefixing col name to meta fields. (3rd col and 4th 
col)
   > > ```
   > > select * from hudi_ny where tpep_pickup_datetime like '%00:04:03%';
   > > 20210802105420   20210802105420_2_23 2019-01-01 00:04:03 
2019-01-01  
c5e6a617-dfc5-4051-8c1a-8daead3847af-0_2-37-62_20210802105420.parquet   2   
2019-01-01 00:04:03 2019-01-01 00:11:48 1   3.011   N   
137 262 1   10.00.5 0.5 2.260.0 0.3 13.56   
NULL2019-01-01
   > > 20210803162030   20210803162030_0_1  tpep_pickup_datetime:2021-01-01 
00:04:03date_col=2021-01-01 
c5c72f9e-9a63-48ca-a981-4302890f5210-0_0-27-1635_20210803162030.parquet 2   
2021-01-01 00:04:03 2021-01-01 00:11:48 1   3.011   N   
137 262 10.00.5 0.5 2.260.0 0.3 13.56   NULL
2021-01-01
   > > Time taken: 0.524 seconds, Fetched 2 row(s)
   > > ```
   > > 
   > > 
   > > 
   > >   
   > > 
   > > 
   > >   
   > > 
   > > 
   > > 
   > >   
   > > 1st row was part of the table before onboarding to spark-sql.
   > > 2nd row was inserted using insert into.
   > > Hi @nsivabalan , I know the difference now. The spark sql use the 
`SqlKeyGenerator` which is a sub-class of `ComplexKeyGenerator` to generated 
record key which will add the column name to the record key. While the 
`SimpleKeyGenerator` will not do that. So we should keep the behavior the same 
for `ComplexKeyGenerator` and `SimpleKeyGenerator`.
   > 
   > sorry, I don't get you. I understand SqlKeyGenerator extends from 
ComplexKeyGen. but why do we need to keep the same for SimpleKeyGen? We should 
not add any field prefix for SimpleKeyGen. If not, no updates will work for an 
existing table.
   
   Hi @nsivabalan , Have solved the record key not matched issue. Please take a 
test again~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org