[GitHub] [incubator-hudi] lamber-ken commented on issue #1334: [HUDI-612] Fix return no data when using incremental query

2020-02-13 Thread GitBox
lamber-ken commented on issue #1334: [HUDI-612] Fix return no data when using 
incremental query
URL: https://github.com/apache/incubator-hudi/pull/1334#issuecomment-586136740
 
 
   hi @vinothchandar, we met this problem in our staging env, we want to know 
the "incremental" data when specify a specific time instant.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Closed] (HUDI-612) Fix return no data when using incremental query

2020-02-13 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed HUDI-612.
---
Resolution: Not A Bug

Sorry, I understand the "incremental" in a wrong way.

Btw, how does user query data when specify a specific time instant, like
{code:java}
spark.
read.
format("org.apache.hudi").
option("hoodie.datasource.query.type", "incremental").
option("hoodie.datasource.read.begin.instanttime", "20200214083611").
option("hoodie.datasource.read.end.instanttime", "20200214083611").
load(basePath).
createOrReplaceTempView("hudi_trips_incremental")spark.sql("select * from 
hudi_trips_incremental").show(50)
{code}

> Fix return no data when using incremental query 
> 
>
> Key: HUDI-612
> URL: https://issues.apache.org/jira/browse/HUDI-612
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> if the value of "hoodie.datasource.read.begin.instanttime" equals 
> "hoodie.datasource.read.end.instanttime", will return no data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1334: [HUDI-612] Fix return no data when using incremental query

2020-02-13 Thread GitBox
lamber-ken commented on issue #1334: [HUDI-612] Fix return no data when using 
incremental query
URL: https://github.com/apache/incubator-hudi/pull/1334#issuecomment-586135378
 
 
   Right,  I understand the "incremental" in a wrong way. 
   BTW, how does user query data when specify a specific time instant, like 
   ```
   spark.
   read.
   format("org.apache.hudi").
   option("hoodie.datasource.query.type", "incremental").
   option("hoodie.datasource.read.begin.instanttime", "20200214083611").
   option("hoodie.datasource.read.end.instanttime", "20200214083611").
   load(basePath).
   createOrReplaceTempView("hudi_trips_incremental")
   
   spark.sql("select * from hudi_trips_incremental").show(50)
   ```
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1334: [HUDI-612] Fix return no data when using incremental query

2020-02-13 Thread GitBox
vinothchandar commented on issue #1334: [HUDI-612] Fix return no data when 
using incremental query
URL: https://github.com/apache/incubator-hudi/pull/1334#issuecomment-586133592
 
 
   can we also close the JIRA? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken closed pull request #1334: [HUDI-612] Fix return no data when using incremental query

2020-02-13 Thread GitBox
lamber-ken closed pull request #1334: [HUDI-612] Fix return no data when using 
incremental query
URL: https://github.com/apache/incubator-hudi/pull/1334
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-407) Implement a join-based index

2020-02-13 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036719#comment-17036719
 ] 

Vinoth Chandar commented on HUDI-407:
-

okay sounds good 

> Implement a join-based index
> 
>
> Key: HUDI-407
> URL: https://issues.apache.org/jira/browse/HUDI-407
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, newbie, Performance
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bwu2 commented on issue #1328: Hudi upsert hangs

2020-02-13 Thread GitBox
bwu2 commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-586108720
 
 
   Ok, after reading https://github.com/apache/incubator-hudi/issues/800 (which 
is simialr), I think this is a memory issue (even though GC seems to be a small 
proportion of the total time, according to the Spark history server & we're not 
getting any OOME).
   
   When I double the `spark.executor.memory` the job runs much faster. I'm sure 
there are maybe other optimizations and/or tuning I can consider also.
   
   Is it normal/expected that an upsert with mostly updates would require more 
memory than an upsert with mostly inserts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs

2020-02-13 Thread GitBox
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-586108720
 
 
   Ok, after reading https://github.com/apache/incubator-hudi/issues/800 (which 
is similar), I think this is a memory issue (even though GC seems to be a small 
proportion of the total time, according to the Spark history server & we're not 
getting any OOME).
   
   When I double the `spark.executor.memory` the job runs much faster. I'm sure 
there are maybe other optimizations and/or tuning I can consider also.
   
   Is it normal/expected that an upsert with mostly updates would require more 
memory than an upsert with mostly inserts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch hudi_test_suite_refactor updated (67fdda3 -> 6fbd285)

2020-02-13 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard 67fdda3  [MINOR] Fix compile error after rebasing the branch
 add 6fbd285  [MINOR] Fix compile error after rebasing the branch

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (67fdda3)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (6fbd285)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 hudi-test-suite/README.md| 2 +-
 packaging/hudi-test-suite-bundle/pom.xml | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)



[jira] [Updated] (HUDI-612) Fix return no data when using incremental query

2020-02-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-612:

Labels: pull-request-available  (was: )

> Fix return no data when using incremental query 
> 
>
> Key: HUDI-612
> URL: https://issues.apache.org/jira/browse/HUDI-612
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>
> if the value of "hoodie.datasource.read.begin.instanttime" equals 
> "hoodie.datasource.read.end.instanttime", will return no data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1334: [HUDI-612] Fix return no data when using incremental query

2020-02-13 Thread GitBox
lamber-ken opened a new pull request #1334: [HUDI-612] Fix return no data when 
using incremental query
URL: https://github.com/apache/incubator-hudi/pull/1334
 
 
   ## What is the purpose of the pull request
   
   When using incremental query, if specify a specific time instant, will 
return no data.
   
   ## Brief change log
   
 - Change `GREATER` to `GREATER_OR_EQUAL` like `LESSER_OR_EQUAL` does.
   
   ## Verify this pull request
   
   This pull request is already covered by 
`TestHoodieActiveTimeline#testTimelineOperations`.
   
   ## Committer checklist
   
- [X] Has a corresponding JIRA in PR title & commit

- [X] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-612) Fix return no data when using incremental query

2020-02-13 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-612:

Status: Open  (was: New)

> Fix return no data when using incremental query 
> 
>
> Key: HUDI-612
> URL: https://issues.apache.org/jira/browse/HUDI-612
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> if the value of "hoodie.datasource.read.begin.instanttime" equals 
> "hoodie.datasource.read.end.instanttime", will return no data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-612) Fix return no data when using incremental query

2020-02-13 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-612:
---

Assignee: lamber-ken

> Fix return no data when using incremental query 
> 
>
> Key: HUDI-612
> URL: https://issues.apache.org/jira/browse/HUDI-612
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> if the value of "hoodie.datasource.read.begin.instanttime" equals 
> "hoodie.datasource.read.end.instanttime", will return no data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-612) Fix return no data when using incremental query

2020-02-13 Thread lamber-ken (Jira)
lamber-ken created HUDI-612:
---

 Summary: Fix return no data when using incremental query 
 Key: HUDI-612
 URL: https://issues.apache.org/jira/browse/HUDI-612
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Common Core
Reporter: lamber-ken


if the value of "hoodie.datasource.read.begin.instanttime" equals 
"hoodie.datasource.read.end.instanttime", will return no data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #188

2020-02-13 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.29 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.2-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an 

[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs

2020-02-13 Thread GitBox
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-586071719
 
 
   Ok, thanks for this. 
   
   I have run the jobs again. First, insert 4m records, then upsert 3m of them, 
then upsert 4m, then upsert 4m. The two jobs upserting 3m records work fine and 
quickly, but the one where upsert 4m takes >200 times as long. There is no 
partitioning and only one (small) output file. 
   
   My results (from a synthetic dataset) are:
   ```bash
   hudi:json_data->commits show --limit 4
   
╔╤═╤═══╤═╤══╤═══╤══╤══╗
   ║ CommitTime │ Total Bytes Written │ Total Files Added │ Total Files 
Updated │ Total Partitions Written │ Total Records Written │ Total Update 
Records Written │ Total Errors ║
   
╠╪═╪═══╪═╪══╪═══╪══╪══╣
   ║ 20200214013937 │ 25.5 MB │ 0 │ 1   
│ 1│ 400   │ 300
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224532 │ 25.5 MB │ 0 │ 1   
│ 1│ 400   │ 400
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224325 │ 25.6 MB │ 0 │ 1   
│ 1│ 400   │ 300
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224218 │ 25.5 MB │ 1 │ 0   
│ 1│ 400   │ 0  
  │ 0║
   
╚╧═╧═══╧═╧══╧═══╧══╧══╝
   ```
   
   and the times:
   ```bash
   grep -n -e totalCreateTime -e totalUpsertTime  *.commit
   20200213224218.commit:36:  "totalCreateTime" : 30012,
   20200213224218.commit:37:  "totalUpsertTime" : 0,
   20200213224325.commit:36:  "totalCreateTime" : 0,
   20200213224325.commit:37:  "totalUpsertTime" : 46879,
   20200213224532.commit:36:  "totalCreateTime" : 0,
   20200213224532.commit:37:  "totalUpsertTime" : 10347280,
   20200214013937.commit:36:  "totalCreateTime" : 0,
   20200214013937.commit:37:  "totalUpsertTime" : 44598,
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (HUDI-407) Implement a join-based index

2020-02-13 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036652#comment-17036652
 ] 

sivabalan narayanan edited comment on HUDI-407 at 2/14/20 2:49 AM:
---

[~vinoth] actually yes. Will put up something by this weekend. I forgot about 
this patch. 


was (Author: shivnarayan):
actually yes. Will put up something by this weekend. I forgot about this patch. 

> Implement a join-based index
> 
>
> Key: HUDI-407
> URL: https://issues.apache.org/jira/browse/HUDI-407
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, newbie, Performance
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs

2020-02-13 Thread GitBox
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-586071719
 
 
   Ok, thanks for this. 
   
   I have run the jobs again. First, insert 4m records, then upsert 3m of them, 
then upsert 4m, then upsert 4m. The two jobs upserting 3m records work fine and 
quickly, but the one where upsert 4m takes >200 times as long. 
   
   My results (from a synthetic dataset) are:
   ```bash
   hudi:json_data->commits show --limit 4
   
╔╤═╤═══╤═╤══╤═══╤══╤══╗
   ║ CommitTime │ Total Bytes Written │ Total Files Added │ Total Files 
Updated │ Total Partitions Written │ Total Records Written │ Total Update 
Records Written │ Total Errors ║
   
╠╪═╪═══╪═╪══╪═══╪══╪══╣
   ║ 20200214013937 │ 25.5 MB │ 0 │ 1   
│ 1│ 400   │ 300
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224532 │ 25.5 MB │ 0 │ 1   
│ 1│ 400   │ 400
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224325 │ 25.6 MB │ 0 │ 1   
│ 1│ 400   │ 300
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224218 │ 25.5 MB │ 1 │ 0   
│ 1│ 400   │ 0  
  │ 0║
   
╚╧═╧═══╧═╧══╧═══╧══╧══╝
   ```
   
   and the times:
   ```bash
   grep -n -e totalCreateTime -e totalUpsertTime  *.commit
   20200213224218.commit:36:  "totalCreateTime" : 30012,
   20200213224218.commit:37:  "totalUpsertTime" : 0,
   20200213224325.commit:36:  "totalCreateTime" : 0,
   20200213224325.commit:37:  "totalUpsertTime" : 46879,
   20200213224532.commit:36:  "totalCreateTime" : 0,
   20200213224532.commit:37:  "totalUpsertTime" : 10347280,
   20200214013937.commit:36:  "totalCreateTime" : 0,
   20200214013937.commit:37:  "totalUpsertTime" : 44598,
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 edited a comment on issue #1328: Hudi upsert hangs

2020-02-13 Thread GitBox
bwu2 edited a comment on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-586071719
 
 
   Ok, thanks for this. 
   
   I have run the jobs again. First, insert 4m records, then upsert 3m of them, 
then upsert 4m, then upsert 4m. The two jobs upserting 3m records work fine and 
quickly, but the one where upsert 4m takes >200 times as long. 
   
   My results (from a synthetic dataset) are:
   `
   hudi:json_data->commits show --limit 4
   
╔╤═╤═══╤═╤══╤═══╤══╤══╗
   ║ CommitTime │ Total Bytes Written │ Total Files Added │ Total Files 
Updated │ Total Partitions Written │ Total Records Written │ Total Update 
Records Written │ Total Errors ║
   
╠╪═╪═══╪═╪══╪═══╪══╪══╣
   ║ 20200214013937 │ 25.5 MB │ 0 │ 1   
│ 1│ 400   │ 300
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224532 │ 25.5 MB │ 0 │ 1   
│ 1│ 400   │ 400
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224325 │ 25.6 MB │ 0 │ 1   
│ 1│ 400   │ 300
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224218 │ 25.5 MB │ 1 │ 0   
│ 1│ 400   │ 0  
  │ 0║
   
╚╧═╧═══╧═╧══╧═══╧══╧══╝
   `
   
   and the times:
   `
   grep -n -e totalCreateTime -e totalUpsertTime  *.commit
   20200213224218.commit:36:  "totalCreateTime" : 30012,
   20200213224218.commit:37:  "totalUpsertTime" : 0,
   20200213224325.commit:36:  "totalCreateTime" : 0,
   20200213224325.commit:37:  "totalUpsertTime" : 46879,
   20200213224532.commit:36:  "totalCreateTime" : 0,
   20200213224532.commit:37:  "totalUpsertTime" : 10347280,
   20200214013937.commit:36:  "totalCreateTime" : 0,
   20200214013937.commit:37:  "totalUpsertTime" : 44598,
   `


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-407) Implement a join-based index

2020-02-13 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036652#comment-17036652
 ] 

sivabalan narayanan commented on HUDI-407:
--

actually yes. Will put up something by this weekend. I forgot about this patch. 

> Implement a join-based index
> 
>
> Key: HUDI-407
> URL: https://issues.apache.org/jira/browse/HUDI-407
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, newbie, Performance
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bwu2 commented on issue #1328: Hudi upsert hangs

2020-02-13 Thread GitBox
bwu2 commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-586071719
 
 
   Ok, thanks for this. 
   
   I have run the jobs again. First, insert 4m records, then upsert 3m of them, 
then upsert 4m, then upsert 4m. The two jobs upserting 3m records work fine and 
quickly, but the one where upsert 4m takes >200 times as long. 
   
   My results (from a synthetic dataset) are:
   `hudi:json_data->commits show --limit 4
   
╔╤═╤═══╤═╤══╤═══╤══╤══╗
   ║ CommitTime │ Total Bytes Written │ Total Files Added │ Total Files 
Updated │ Total Partitions Written │ Total Records Written │ Total Update 
Records Written │ Total Errors ║
   
╠╪═╪═══╪═╪══╪═══╪══╪══╣
   ║ 20200214013937 │ 25.5 MB │ 0 │ 1   
│ 1│ 400   │ 300
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224532 │ 25.5 MB │ 0 │ 1   
│ 1│ 400   │ 400
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224325 │ 25.6 MB │ 0 │ 1   
│ 1│ 400   │ 300
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213224218 │ 25.5 MB │ 1 │ 0   
│ 1│ 400   │ 0  
  │ 0║
   
╚╧═╧═══╧═╧══╧═══╧══╧══╝
   `
   
   and the times:
   `
   grep -n -e totalCreateTime -e totalUpsertTime  *.commit
   20200213224218.commit:36:  "totalCreateTime" : 30012,
   20200213224218.commit:37:  "totalUpsertTime" : 0,
   20200213224325.commit:36:  "totalCreateTime" : 0,
   20200213224325.commit:37:  "totalUpsertTime" : 46879,
   20200213224532.commit:36:  "totalCreateTime" : 0,
   20200213224532.commit:37:  "totalUpsertTime" : 10347280,
   20200214013937.commit:36:  "totalCreateTime" : 0,
   20200214013937.commit:37:  "totalUpsertTime" : 44598,
   `


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-13 Thread GitBox
vinothchandar commented on issue #1100: [HUDI-289] Implement a test suite to 
support long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-586071634
 
 
   >About real tests, do you mean how to validate all the functions of hudi via 
this test suite?
   
   yes.. not all.. but a good amount .. The real value is in the actual tests.. 
right


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226110
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
+as described above via --jars or --packages.
  
-### Read optimized query
-
-Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
-This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
-
-```scala
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
-```
-
-If you prefer to glob paths on DFS via the datasource, you can simply do 
something like below to get a Spark dataframe to work with. 
+### Snapshot query {#spark-snapshot-query}
+By default, Spark SQL will try to use its own parquet support instead of Hive 
SerDe when reading from Hive metastore parquet tables. 
 
 Review comment:
   own parquet reader


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226403
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
+as described above via --jars or --packages.
  
-### Read optimized query
-
-Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
-This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
-
-```scala
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
-```
-
-If you prefer to glob paths on DFS via the datasource, you can simply do 
something like below to get a Spark dataframe to work with. 
+### Snapshot query {#spark-snapshot-query}
+By default, Spark SQL will try to use its own parquet support instead of Hive 
SerDe when reading from Hive metastore parquet tables. 
+However, for MERGE_ON_READ tables which has both parquet and avro data, this 
default setting needs to be turned off using set 
`spark.sql.hive.convertMetastoreParquet=false`. 
+This will force Spark to fallback to using the Hive Serde to read the data 
(planning/executions is still Spark). 
 
 ```java
-Dataset hoodieROViewDF = spark.read().format("org.apache.hudi")
-// pass any path glob, can include hudi & non-hudi tables
-.load("/glob/path/pattern");
+$ spark-shell --driver-class-path /etc/hive/conf  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 
--driver-memory 7g --executor-memory 2g  --master yarn-client
+
+scala> sqlContext.sql("select count(*) from hudi_trips_mor_rt where datestr = 
'2016-10-02'").show()
+scala> sqlContext.sql("select count(*) from hudi_trips_mor_rt where datestr = 
'2016-10-02'").show()
 ```
- 
-### Snapshot query {#spark-snapshot-query}
-Currently, near-real time data can only be queried as a Hive table in Spark 
using snapshot query mode. In order to do this, set 
`spark.sql.hive.convertMetastoreParquet=false`, forcing Spark to fallback 
-to using the Hive Serde to read the data (planning/executions is still Spark). 
 
-```java
-$ spark-shell --jars hudi-spark-bundle_2.11-x.y.z-SNAPSHOT.jar 
--driver-class-path /etc/hive/conf  --packages 
org.apache.spark:spark-avro_2.11:2.4.4 --conf 
spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 
7g --executor-memory 2g  --master yarn-client
+For COPY_ON_WRITE tables, either Hive SerDe can be used by turning off 
convertMetastoreParquet as described above or Spark's built in support can be 
leveraged. 
 
 Review comment:
   `turning off convertMetastoreParquet` please usethe exact and full config 
here within ``


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379225601
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
 
 Review comment:
   use `--jars` `--packages`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226038
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
 
 Review comment:
   lets also spend some time setting context and explaining how this uses 
Spark/Hive integration


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379225202
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
 
 Review comment:
   please point to quick start or some example for this


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226205
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
+When using spark shell instead of --jars, --packages can also be used to fetch 
the hudi-spark-bundle like this: `--packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating`
+For sample setup, refer to [Setup spark-shell in 
quickstart](/docs/quick-start-guide.html#setup-spark-shell).
 
- - **Hudi DataSource** : Supports Read Optimized, Incremental Pulls similar to 
how standard datasources (e.g: `spark.read.parquet`) work.
- - **Read as Hive tables** : Supports all three query types, including the 
snapshot queries, relying on the custom Hudi input formats again like Hive.
- 
- In general, your spark job needs a dependency to `hudi-spark` or 
`hudi-spark-bundle_2.*-x.y.z.jar` needs to be on the class path of driver & 
executors (hint: use `--jars` argument)
+## Spark SQL
+Supports all query types across both Hudi table types, relying on the custom 
Hudi input formats again like Hive. 
+Typically notebook users and spark-shell users leverage spark sql for querying 
Hudi tables. Please add hudi-spark-bundle 
+as described above via --jars or --packages.
  
-### Read optimized query
-
-Pushing a path filter into sparkContext as follows allows for read optimized 
querying of a Hudi hive table using SparkSQL. 
-This method retains Spark built-in optimizations for reading Parquet files 
like vectorized reading on Hudi tables.
-
-```scala
-spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);
-```
-
-If you prefer to glob paths on DFS via the datasource, you can simply do 
something like below to get a Spark dataframe to work with. 
+### Snapshot query {#spark-snapshot-query}
+By default, Spark SQL will try to use its own parquet support instead of Hive 
SerDe when reading from Hive metastore parquet tables. 
 
 Review comment:
   `By default` : are you talking about copy_on_write tables? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226624
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -145,8 +161,13 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 | filterExists() | Filter out already existing records from the provided 
RDD[HoodieRecord]. Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
+### Read optimized query
+
+For read optimized queries, either Hive SerDe can be used by turning off 
convertMetastoreParquet as described above or Spark's built in support can be 
leveraged. 
+If using spark's built in support, additionally a path filter needs to be 
pushed into sparkContext as described earlier.
 
 ## Presto
 
-Presto is a popular query engine, providing interactive query performance. 
Presto currently supports only read optimized queries on Hudi tables. 
-This requires the `hudi-presto-bundle` jar to be placed into 
`/plugin/hive-hadoop2/`, across the installation.
+Presto is a popular query engine, providing interactive query performance. 
Presto currently supports snapshot queries on
+COPY_On_WRITE and read optimized queries on MERGE_ON_READ Hudi tables. This 
requires the `hudi-presto-bundle` jar 
 
 Review comment:
   COPY_ON_WRITE: typo


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379226552
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -145,8 +161,13 @@ Additionally, `HoodieReadClient` offers the following 
functionality using Hudi's
 | filterExists() | Filter out already existing records from the provided 
RDD[HoodieRecord]. Useful for de-duplication |
 | checkExists(keys) | Check if the provided keys exist in a Hudi table |
 
+### Read optimized query
+
+For read optimized queries, either Hive SerDe can be used by turning off 
convertMetastoreParquet as described above or Spark's built in support can be 
leveraged. 
 
 Review comment:
   Remove this section ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379225899
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
 Review comment:
   For this section, can you take another stab.. it feels short and curt.. may 
be set some context on things like : Spark Datasources directly query 
underlying DFS data without Hive (and for this reason I think we should move 
`SparkSQL` up and place before this section) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379225479
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
-Spark provides much easier deployment & management of Hudi jars and bundles 
into jobs/notebooks. At a high level, there are two ways to access Hudi tables 
in Spark.
+Hudi COPY_ON_WRITE tables can be queried via Spark datasource similar to how 
standard datasources work (e.g: `spark.read.parquet`). 
+Both snapshot querying and incremental querying are supported here. Typically 
spark jobs require adding `--jars /hudi-spark-bundle_2.11:0.5.1-incubating`
+to classpath of drivers and executors. Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. 
 
 Review comment:
   can we remove this line ` Refer [building 
Hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)
 for build instructions. ` .. you don't have to build it yourself per se.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch hudi_test_suite_refactor updated (3063cd7 -> 67fdda3)

2020-02-13 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard 3063cd7  [MINOR] Fix compile error after rebasing the branch
 discard e064262  [HUDI-592] Remove duplicated dependencies in the pom file of 
test suite module
 discard 1f64ca0  [HUDI-503] Add hudi test suite documentation into the README 
file of the test suite module (#1191)
 discard 4eafca9  [MINOR] Fix TestHoodieTestSuiteJob#testComplexDag failure
 discard e175677  [HUDI-441] Rename WorkflowDagGenerator and some class names 
in test package
 discard aeac933  Fixed resource leak in HiveTestService about hive meta store
 discard 1136002  [HUDI-591] Support Spark version upgrade
 discard 99b2ae0  [HUDI-442] Fix 
TestComplexKeyGenerator#testSingleValueKeyGenerator and 
testMultipleValueKeyGenerator NPE
 discard 3471112  [MINOR] Fix compile error about the deletion of 
HoodieActiveTimeline#createNewCommitTime
 discard 43fbc0c  [HUDI-391] Rename module name from hudi-bench to 
hudi-test-suite and fix some checkstyle issues (#1102)
 discard 285c691  [HUDI-394] Provide a basic implementation of test suite
 add 01c868a  [HUDI-574] Fix CLI counts small file inserts as updates 
(#1321)
 add 175de0d  [MINOR] Fix typo (#1331)
 add dfbee67  [HUDI-514] A schema provider to get metadata through Jdbc 
(#1200)
 add 1f5051d  [HUDI-394] Provide a basic implementation of test suite
 add c3caeb9  [HUDI-391] Rename module name from hudi-bench to 
hudi-test-suite and fix some checkstyle issues (#1102)
 add 89788a5  [MINOR] Fix compile error about the deletion of 
HoodieActiveTimeline#createNewCommitTime
 add c9c3319  [HUDI-442] Fix 
TestComplexKeyGenerator#testSingleValueKeyGenerator and 
testMultipleValueKeyGenerator NPE
 add 188199b  [HUDI-591] Support Spark version upgrade
 add ea6389d  Fixed resource leak in HiveTestService about hive meta store
 add a823b93  [HUDI-441] Rename WorkflowDagGenerator and some class names 
in test package
 add 64f9924  [MINOR] Fix TestHoodieTestSuiteJob#testComplexDag failure
 add 5b2  [HUDI-503] Add hudi test suite documentation into the README 
file of the test suite module (#1191)
 add c1035d0  [HUDI-592] Remove duplicated dependencies in the pom file of 
test suite module
 add 67fdda3  [MINOR] Fix compile error after rebasing the branch

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (3063cd7)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (67fdda3)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/cli/commands/CommitsCommand.java   |   2 +-
 hudi-utilities/pom.xml |   7 ++
 .../org/apache/hudi/utilities/UtilHelpers.java | 101 +
 .../AbstractDeltaStreamerService.java  |   2 +-
 .../utilities/schema/JdbcbasedSchemaProvider.java  |  84 +
 .../hudi/utilities/TestHoodieDeltaStreamer.java|  11 ++-
 .../utilities/TestJdbcbasedSchemaProvider.java |  88 ++
 .../apache/hudi/utilities/UtilitiesTestBase.java   |  18 +++-
 .../delta-streamer-config/source-jdbc.avsc |  43 -
 .../resources/delta-streamer-config/triprec.sql|  21 +++--
 pom.xml|   1 +
 11 files changed, 351 insertions(+), 27 deletions(-)
 create mode 100644 
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/JdbcbasedSchemaProvider.java
 create mode 100644 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestJdbcbasedSchemaProvider.java
 copy hudi-common/src/test/resources/timestamp-test-evolved.avsc => 
hudi-utilities/src/test/resources/delta-streamer-config/source-jdbc.avsc (60%)
 copy hudi-common/src/test/resources/simple-test.avsc => 
hudi-utilities/src/test/resources/delta-streamer-config/triprec.sql (75%)



[GitHub] [incubator-hudi] liujianhuiouc commented on issue #1216: [HUDI-525] lack of insert info in delta_commit inflight

2020-02-13 Thread GitBox
liujianhuiouc commented on issue #1216: [HUDI-525] lack of insert info in 
delta_commit inflight
URL: https://github.com/apache/incubator-hudi/pull/1216#issuecomment-586070652
 
 
   @n3nash I dont have any definate case , in that case, the fields related to 
updates already in the metadata, so i think the inserts info should also in 
that metadat


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch hudi_test_suite_refactor updated (b04b037 -> 3063cd7)

2020-02-13 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


omit b04b037  [MINOR] Fix compile error after rebasing the branch
 add 3063cd7  [MINOR] Fix compile error after rebasing the branch

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (b04b037)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (3063cd7)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../main/java/org/apache/hudi/testsuite/generator/DeltaGenerator.java | 2 +-
 .../main/java/org/apache/hudi/testsuite/job/HoodieTestSuiteJob.java   | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)



[GitHub] [incubator-hudi] vinothchandar commented on issue #1308: [Hudi-561] partition path config

2020-02-13 Thread GitBox
vinothchandar commented on issue #1308: [Hudi-561] partition path config
URL: https://github.com/apache/incubator-hudi/pull/1308#issuecomment-586067261
 
 
   hi @UZi5136225 you can actually define your own key generator class to do 
this.. This PR should not necessary.. 
   
   https://hudi.apache.org/docs/configurations.html#KEYGENERATOR_CLASS_OPT_KEY 
   
   we even have some built-in ones.. 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on issue #1312: [HUDI-571] Add "compactions show archived" command to CLI

2020-02-13 Thread GitBox
satishkotha commented on issue #1312: [HUDI-571] Add "compactions show 
archived" command to CLI
URL: https://github.com/apache/incubator-hudi/pull/1312#issuecomment-586066884
 
 
   @n3nash I removed the comment. please take a look. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1312: [HUDI-571] Add "compactions show archived" command to CLI

2020-02-13 Thread GitBox
satishkotha commented on a change in pull request #1312: [HUDI-571] Add 
"compactions show archived" command to CLI
URL: https://github.com/apache/incubator-hudi/pull/1312#discussion_r379220952
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
 ##
 @@ -249,6 +262,128 @@ public String compact(
 return "Compaction successfully completed for " + compactionInstantTime;
   }
 
+  /**
+   * Prints all compaction details.
+   * Note values in compactionsPlans and instants lists are expected to be in 
cohesion
 
 Review comment:
   its no longer relevant. so removed it. I intially had two parameters: 
List and List. But that didnt seem good, so changed it 
to one parameter. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1330: [HUDI-607] Fix to allow creation/syncing of Hive tables partitioned by Date type columns

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1330: [HUDI-607] Fix to 
allow creation/syncing of Hive tables partitioned by Date type columns
URL: https://github.com/apache/incubator-hudi/pull/1330#discussion_r379219794
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
 ##
 @@ -77,6 +80,11 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
 
   // return, if last part of name
   if (i == parts.length - 1) {
+
+if (isLogicalTypeDate(valueNode, part)) {
 
 Review comment:
   Should we do this conversion out in the caller of this method instead? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1330: [HUDI-607] Fix to allow creation/syncing of Hive tables partitioned by Date type columns

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1330: [HUDI-607] Fix to 
allow creation/syncing of Hive tables partitioned by Date type columns
URL: https://github.com/apache/incubator-hudi/pull/1330#discussion_r379219969
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
 ##
 @@ -77,6 +80,11 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
 
   // return, if last part of name
   if (i == parts.length - 1) {
+
+if (isLogicalTypeDate(valueNode, part)) {
+  return LocalDate.ofEpochDay(Long.parseLong(val.toString()));
 
 Review comment:
   May not be related to this PR. but is time zone information lost when we 
convert from Date to Long (row -> avro)?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-607) Hive sync fails to register tables partitioned by Date Type column

2020-02-13 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-607:

Fix Version/s: 0.6.0

> Hive sync fails to register tables partitioned by Date Type column
> --
>
> Key: HUDI-607
> URL: https://issues.apache.org/jira/browse/HUDI-607
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h2. Issue Description
> As part of spark to avro conversion, Spark's *Date* type is represented as 
> corresponding *Date Logical Type* in Avro, which is underneath represented in 
> Avro by physical *Integer* type. For this reason when forming the Avro 
> records from Spark rows, it is converted to corresponding *Epoch day* to be 
> stored as corresponding *Integer* value in the parquet files.
> However, this manifests into a problem that when a *Date Type* column is 
> chosen as partition column. In this case, Hudi's partition column 
> *_hoodie_partition_path* also gets the corresponding *epoch day integer* 
> value when reading the partition field from the avro record, and as a result 
> syncing partitions in hudi table issues a command like the following, where 
> the date is an integer:
> {noformat}
> ALTER TABLE uditme_hudi.uditme_hudi_events_cow_feb05_00 ADD IF NOT EXISTS   
> PARTITION (event_date='17897') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17897'
>PARTITION (event_date='17898') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17898'
>PARTITION (event_date='17899') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17899'
>PARTITION (event_date='17900') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17900'{noformat}
> Hive is not able to make sense of the partition field values like *17897* as 
> it is not able to convert it to corresponding date from this string. It 
> actually expects the actual date to be represented in string form.
> So, we need to make sure that Hudi's partition field gets the actual date 
> value in string form, instead of the integer. This change makes sure that 
> when a fields value is retrieved from the Avro record, we check that if its 
> *Date Logical Type* we return the actual date value, instead of the epoch. 
> After this change the command for sync partitions issues is like:
> {noformat}
> ALTER TABLE `uditme_hudi`.`uditme_hudi_events_cow_feb05_01` ADD IF NOT EXISTS 
>   PARTITION (`event_date`='2019-01-01') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-01'
>PARTITION (`event_date`='2019-01-02') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-02'
>PARTITION (`event_date`='2019-01-03') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-03'
>PARTITION (`event_date`='2019-01-04') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-04'{noformat}
> h2. Stack Trace
> {noformat}
> 20/01/13 23:28:04 INFO HoodieHiveClient: Last commit time synced is not 
> known, listing all partitions in 
> s3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar,FS
>  :com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@1f0c8e1f
> 20/01/13 23:28:08 INFO HiveSyncTool: Storage partitions scan complete. Found 
> 31
> 20/01/13 23:28:08 INFO HiveSyncTool: New Partitions [18206, 18207, 18208, 
> 18209, 18210, 18211, 18212, 18213, 18214, 18215, 18216, 18217, 18218, 18219, 
> 18220, 18221, 18222, 18223, 18224, 18225, 18226, 18227, 18228, 18229, 18230, 
> 18231, 18232, 18233, 18234, 18235, 18236]
> 20/01/13 23:28:08 INFO HoodieHiveClient: Adding partitions 31 to table 
> fact_hourly_search_term_conversions_hudi_mor_hudi_jar
> 20/01/13 23:28:08 INFO HoodieHiveClient: Executing SQL ALTER TABLE 
> default.fact_hourly_search_term_conversions_hudi_mor_hudi_jar ADD IF NOT 
> EXISTS   PARTITION (dim_date='18206') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18206'
>PARTITION (dim_date='18207') LOCATION $
> s3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18207'
>PARTITION (dim_date='18208') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18208'
>PARTITION (dim_date='18209') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_$
> n_read_aws_hudi_jar/18209'   PARTITION 

[GitHub] [incubator-hudi] vinothchandar commented on issue #1200: [HUDI-514] A schema provider to get metadata through Jdbc

2020-02-13 Thread GitBox
vinothchandar commented on issue #1200: [HUDI-514] A schema provider to get 
metadata through Jdbc
URL: https://github.com/apache/incubator-hudi/pull/1200#issuecomment-586062515
 
 
   @OpenOpened Thanks for working through this issue with us patiently! merged!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (175de0d -> dfbee67)

2020-02-13 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 175de0d  [MINOR] Fix typo (#1331)
 add dfbee67  [HUDI-514] A schema provider to get metadata through Jdbc 
(#1200)

No new revisions were added by this update.

Summary of changes:
 hudi-utilities/pom.xml |   7 ++
 .../org/apache/hudi/utilities/UtilHelpers.java | 100 +
 .../utilities/schema/JdbcbasedSchemaProvider.java  |  84 +
 .../hudi/utilities/TestHoodieDeltaStreamer.java|   3 +-
 .../utilities/TestJdbcbasedSchemaProvider.java |  88 ++
 .../apache/hudi/utilities/UtilitiesTestBase.java   |  12 ++-
 .../delta-streamer-config/source-jdbc.avsc |  43 +++--
 .../resources/delta-streamer-config/triprec.sql|  21 ++---
 pom.xml|   1 +
 9 files changed, 340 insertions(+), 19 deletions(-)
 create mode 100644 
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/JdbcbasedSchemaProvider.java
 create mode 100644 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestJdbcbasedSchemaProvider.java
 copy hudi-common/src/test/resources/timestamp-test-evolved.avsc => 
hudi-utilities/src/test/resources/delta-streamer-config/source-jdbc.avsc (60%)
 copy hudi-common/src/test/resources/simple-test.avsc => 
hudi-utilities/src/test/resources/delta-streamer-config/triprec.sql (75%)



[GitHub] [incubator-hudi] vinothchandar merged pull request #1200: [HUDI-514] A schema provider to get metadata through Jdbc

2020-02-13 Thread GitBox
vinothchandar merged pull request #1200: [HUDI-514] A schema provider to get 
metadata through Jdbc
URL: https://github.com/apache/incubator-hudi/pull/1200
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-13 Thread GitBox
yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support 
long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-586062275
 
 
   > > How have we validated the framework? are there real tests now?
   > 
   > Can you also please answer these?
   
   We validate the test suite via running the unit tests and e2e 
tests(`TestHoodieTestSuiteJob`) by local, Travis and docker. There is one mock 
test 
here[1](https://github.com/apache/incubator-hudi/blob/hudi_test_suite_refactor/hudi-test-suite/src/test/resources/hudi-test-suite-config/complex-workflow-dag-cow.yaml).
 About real tests, do you mean how to validate all the functions of hudi via 
this test suite?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-02-13 Thread GitBox
vinothchandar commented on issue #1165: [HUDI-76] Add CSV Source support for 
Hudi Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-586061084
 
 
   cc @nsivabalan  could you take over in case @pratyakshsharma is busy.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1176: [WIP] [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-02-13 Thread GitBox
vinothchandar commented on issue #1176: [WIP] [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-586061205
 
 
   @nsivabalan are you still blocked by the test failures for this one? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-02-13 Thread GitBox
vinothchandar commented on issue #1150: [HUDI-288]: Add support for ingesting 
multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#issuecomment-586060460
 
 
   @pratyakshsharma  lets revive this ? :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r379216693
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/util/DFSTablePropertiesConfiguration.java
 ##
 @@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.util;
+
+import org.apache.hudi.common.model.TableConfig;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Used for parsing custom files having TableConfig objects.
+ */
+public class DFSTablePropertiesConfiguration {
 
 Review comment:
   I think your understanding is correct.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r379215926
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -80,16 +79,27 @@
   + "{\"name\": \"begin_lat\", \"type\": \"double\"},{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
   + "{\"name\": \"end_lat\", \"type\": \"double\"},{\"name\": \"end_lon\", 
\"type\": \"double\"},"
   + "{\"name\":\"fare\",\"type\": \"double\"}]}";
+  public static String GROCERY_PURCHASE_SCHEMA = 
"{\"type\":\"record\",\"name\":\"purchaserec\",\"fields\":["
 
 Review comment:
   That last sentence, can happen on a separate JIRA.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r379215764
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -80,16 +79,27 @@
   + "{\"name\": \"begin_lat\", \"type\": \"double\"},{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
   + "{\"name\": \"end_lat\", \"type\": \"double\"},{\"name\": \"end_lon\", 
\"type\": \"double\"},"
   + "{\"name\":\"fare\",\"type\": \"double\"}]}";
+  public static String GROCERY_PURCHASE_SCHEMA = 
"{\"type\":\"record\",\"name\":\"purchaserec\",\"fields\":["
 
 Review comment:
   I think what @bvaradar is trying to get at it is that the current payload is 
built around the trip schema.. Overall, can we just have multiple topics with 
the same trips schema? I am not sure what real benefits we get from introducing 
new schema here.. if we do, I would like to spend more time structuring this 
class well , since it will make this class even more unclean than what it is 
today.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-02-13 Thread GitBox
vinothchandar commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r379215764
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -80,16 +79,27 @@
   + "{\"name\": \"begin_lat\", \"type\": \"double\"},{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
   + "{\"name\": \"end_lat\", \"type\": \"double\"},{\"name\": \"end_lon\", 
\"type\": \"double\"},"
   + "{\"name\":\"fare\",\"type\": \"double\"}]}";
+  public static String GROCERY_PURCHASE_SCHEMA = 
"{\"type\":\"record\",\"name\":\"purchaserec\",\"fields\":["
 
 Review comment:
   I think what @bvaradar is trying to get at it is that the current payload is 
built around the trip schema.. Overall, can we just have multiple topics with 
the same trips schema? I am not sure what real benefits we get from introducing 
new schema here.. if you feel strongly still, I would like to spend more time 
structuring this class well , since it will make this class even more unclean 
than what it is today.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1332: [HUDI -409] Match header and footer block length to improve corrupted block detection

2020-02-13 Thread GitBox
n3nash commented on a change in pull request #1332: [HUDI -409] Match header 
and footer block length to improve corrupted block detection
URL: https://github.com/apache/incubator-hudi/pull/1332#discussion_r379212556
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 ##
 @@ -239,6 +239,15 @@ private boolean isBlockCorrupt(int blocksize) throws 
IOException {
   return true;
 }
 
+// check if the blocksize mentioned in the footer is the same as the 
header; by seeking back the length of a long
+// the backward seek does not incur additional IO as {@link 
org.apache.hadoop.hdfs.DFSInputStream#seek()}
+// only moves the index. actual IO happens on the next read operation
+inputStream.seek(inputStream.getPos() - Long.BYTES);
+long blockSizeFooter = inputStream.readLong() - MAGIC_BUFFER.length;
 
 Review comment:
   nit : s/blockSizeFooter/blockSizeFromFooter


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1332: [HUDI -409] Match header and footer block length to improve corrupted block detection

2020-02-13 Thread GitBox
n3nash commented on a change in pull request #1332: [HUDI -409] Match header 
and footer block length to improve corrupted block detection
URL: https://github.com/apache/incubator-hudi/pull/1332#discussion_r379212556
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 ##
 @@ -239,6 +239,15 @@ private boolean isBlockCorrupt(int blocksize) throws 
IOException {
   return true;
 }
 
+// check if the blocksize mentioned in the footer is the same as the 
header; by seeking back the length of a long
+// the backward seek does not incur additional IO as {@link 
org.apache.hadoop.hdfs.DFSInputStream#seek()}
+// only moves the index. actual IO happens on the next read operation
+inputStream.seek(inputStream.getPos() - Long.BYTES);
+long blockSizeFooter = inputStream.readLong() - MAGIC_BUFFER.length;
 
 Review comment:
   nit : s/blockSizeFooter/blockSizeInFooter


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1332: [HUDI -409] Match header and footer block length to improve corrupted block detection

2020-02-13 Thread GitBox
n3nash commented on a change in pull request #1332: [HUDI -409] Match header 
and footer block length to improve corrupted block detection
URL: https://github.com/apache/incubator-hudi/pull/1332#discussion_r379212070
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 ##
 @@ -239,6 +239,15 @@ private boolean isBlockCorrupt(int blocksize) throws 
IOException {
   return true;
 }
 
+// check if the blocksize mentioned in the footer is the same as the 
header; by seeking back the length of a long
+// the backward seek does not incur additional IO as {@link 
org.apache.hadoop.hdfs.DFSInputStream#seek()}
+// only moves the index. actual IO happens on the next read operation
+inputStream.seek(inputStream.getPos() - Long.BYTES);
 
 Review comment:
   This is assuming that the block length will always be the last thing written 
in the footer, can we add comments in the footer writer part to denote this 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on issue #1329: [SUPPORT] Presto cannot query non-partitioned table

2020-02-13 Thread GitBox
bhasudha commented on issue #1329: [SUPPORT] Presto cannot query 
non-partitioned table
URL: https://github.com/apache/incubator-hudi/issues/1329#issuecomment-586038416
 
 
   @popart  The stack trace you showed looks like somehow a partition metafile 
(".hoodie_partition_metadata") is created in the table path. If this file is 
present, Hudi tries to read the partition depth from this file by searchign for 
the key "partitionDepth" in that metafile. An exception "Could not find 
partitionDepth in partition metafile" is thrown in this case. Can you quickly 
check if this partition metafile is  present in your table base path. If that 
is the case, we need to dig why that is created even though you chose a non 
partitioned table. 
   
   
   Also, just wanted to check did you mean presto version 0.227 ? Also, from 
the stack trace ( that you posted specific to Presto ), it looks liek this is 
coming from earlier Hudi version 0.5.0-incubating. Can you also confirm if this 
is true for further debugging ? Do you mind creating a Jira issue with these 
details ?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ramachandranms commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
ramachandranms commented on a change in pull request #1333: [HUDI-589][DOCS] 
Fix querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379190434
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -24,31 +24,49 @@ If `table name = hudi_trips` and `table type = 
MERGE_ON_READ`, then we get:
  
 
 As discussed in the concepts section, the one key primitive needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a table). Hudi 
tables can be pulled incrementally, which means you can get ALL and ONLY the 
updated & new rows 
-since a specified instant time. This, together with upserts, are particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally pulled (streams/facts),
-joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is 
realized by querying one of the tables above, 
+is obtaining a change stream/log from a table. Hudi tables can be queried 
incrementally, which means you can get ALL and ONLY the updated & new rows 
+since a specified instant time. This, together with upserts, is particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally queried (streams/facts),
+joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental queries 
are realized by querying one of the tables above, 
 with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the table. 
 
-In sections, below we will discuss how to access these query types from 
different query engines.
+
+## SUPPORT MATRIX
+
+### COPY_ON_WRITE tables
+  
+||Snapshot|Incremental|Read Optimized|
+|||---|--|
+|**Hive**|Y|Y|N/A|
+|**Spark datasource**|Y|Y|N/A|
+|**Spark SQL**|Y|Y|N/A|
+|**Presto**|Y|N|N/A|
+
+### MERGE_ON_READ tables
+
+||Snapshot|Incremental|Read Optimized|
+|||---|--|
+|**Hive**|Y|Y|Y|
+|**Spark datasource**|N|N|Y|
+|**Spark SQL**|Y|Y|Y|
+|**Presto**|N|N|Y|
 
 Review comment:
   `Presto` section below says snapshot queries are supported for presto. 
Support matrix says `N`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ramachandranms commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
ramachandranms commented on a change in pull request #1333: [HUDI-589][DOCS] 
Fix querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379190434
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -24,31 +24,49 @@ If `table name = hudi_trips` and `table type = 
MERGE_ON_READ`, then we get:
  
 
 As discussed in the concepts section, the one key primitive needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a table). Hudi 
tables can be pulled incrementally, which means you can get ALL and ONLY the 
updated & new rows 
-since a specified instant time. This, together with upserts, are particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally pulled (streams/facts),
-joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is 
realized by querying one of the tables above, 
+is obtaining a change stream/log from a table. Hudi tables can be queried 
incrementally, which means you can get ALL and ONLY the updated & new rows 
+since a specified instant time. This, together with upserts, is particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally queried (streams/facts),
+joined with other tables (tables/dimensions), to [write out 
deltas](/docs/writing_data.html) to a target Hudi table. Incremental queries 
are realized by querying one of the tables above, 
 with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the table. 
 
-In sections, below we will discuss how to access these query types from 
different query engines.
+
+## SUPPORT MATRIX
+
+### COPY_ON_WRITE tables
+  
+||Snapshot|Incremental|Read Optimized|
+|||---|--|
+|**Hive**|Y|Y|N/A|
+|**Spark datasource**|Y|Y|N/A|
+|**Spark SQL**|Y|Y|N/A|
+|**Presto**|Y|N|N/A|
+
+### MERGE_ON_READ tables
+
+||Snapshot|Incremental|Read Optimized|
+|||---|--|
+|**Hive**|Y|Y|Y|
+|**Spark datasource**|N|N|Y|
+|**Spark SQL**|Y|Y|Y|
+|**Presto**|N|N|Y|
 
 Review comment:
   `Presto` section below says snapshot queries are supported for presto. 
Support matrix says `N`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on issue #1316: [HUDI-604] Update docker page

2020-02-13 Thread GitBox
bhasudha commented on issue #1316: [HUDI-604] Update docker page
URL: https://github.com/apache/incubator-hudi/pull/1316#issuecomment-586033608
 
 
   Will merge this with other doc changes later next week after cutting 0.5.1 
doc version


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on issue #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
bhasudha commented on issue #1333: [HUDI-589][DOCS] Fix querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#issuecomment-586032324
 
 
   @vinothchandar @leesf Attaching screenshot pdf here for look and feel 
[Querying Hudi Tables - Apache 
Hudi.pdf](https://github.com/apache/incubator-hudi/files/4202018/Querying.Hudi.Tables.-.Apache.Hudi.pdf)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-589) Fix references to Views in some of the pages. Replace with Query instead

2020-02-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-589:

Labels: pull-request-available  (was: )

> Fix references to Views in some of the pages. Replace with Query instead
> 
>
> Key: HUDI-589
> URL: https://issues.apache.org/jira/browse/HUDI-589
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
>
> querying_data.html still has some references to 'views'. This needs to be 
> replaced with 'queries'/'query types' appropriately.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bhasudha opened a new pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-13 Thread GitBox
bhasudha opened a new pull request #1333: [HUDI-589][DOCS] Fix querying_data 
page
URL: https://github.com/apache/incubator-hudi/pull/1333
 
 
   - Added support matrix for COW and MOR tables
   - Change reference from (`views`|`pulls`) to `queries`
   - And minor restructuring
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [x] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ramachandranms opened a new pull request #1332: [HUDI -409] Match header and footer block length to improve corrupted block detection

2020-02-13 Thread GitBox
ramachandranms opened a new pull request #1332: [HUDI -409] Match header and 
footer block length to improve corrupted block detection
URL: https://github.com/apache/incubator-hudi/pull/1332
 
 
   ## What is the purpose of the pull request
   
   - This PR is to address the JIRA ticket - 
[HUDI-409](https://issues.apache.org/jira/browse/HUDI-409)
   - Current logic to determine corrupted block (by just using \#HUDI\# marker 
alone) fails to address following issues
 1. Crashing the reader if the next block of the corrupted block has the 
exact same size as amount of missing data from the corrupted block
 2. Start of the next block is determined solely based on the marker and 
this can create multiple corrupted block if the partial data of corrupted block 
has the marker as part of the written data
   
   ## Brief change log
   
   - Made changes to HoodieLogFileReader's `next()` and `prev()` methods. Added 
additional check to validate if block sizes stored at start and end of the 
blocks are the same. Report block as corrupted if they don't match
   - Modified existing unit tests 
`TestHoodieLogFormat#testAppendAndReadOnCorruptedLog()` & 
`TestHoodieLogFormat#testAppendAndReadOnCorruptedLogInReverse()` to add 
additional blocks to validate the reader changes
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
   - Modified  `TestHoodieLogFormat#testAppendAndReadOnCorruptedLog()` & 
`TestHoodieLogFormat#testAppendAndReadOnCorruptedLogInReverse()` tests to 
account for the change
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1312: [HUDI-571] Add "compactions show archived" command to CLI

2020-02-13 Thread GitBox
n3nash commented on a change in pull request #1312: [HUDI-571] Add "compactions 
show archived" command to CLI
URL: https://github.com/apache/incubator-hudi/pull/1312#discussion_r379164513
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
 ##
 @@ -249,6 +262,128 @@ public String compact(
 return "Compaction successfully completed for " + compactionInstantTime;
   }
 
+  /**
+   * Prints all compaction details.
+   * Note values in compactionsPlans and instants lists are expected to be in 
cohesion
 
 Review comment:
   what does this comment mean ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-13 Thread GitBox
vinothchandar commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585993053
 
 
   @adamjoneill Hmmm. this is surprising.. 
   
   - Does a `select * from table` on hudi table work using spark sql?  could 
you give this a shot (trying to narrow this down)
   - Are you able to provide a simple reproducible use-case using 
spark.parallelize() as you were trying before. 
   
   I can then take a deeper look.. @bhasudha please chime in as well


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1328: Hudi upsert hangs

2020-02-13 Thread GitBox
vinothchandar commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585991857
 
 
   There must be something else going on.. just used my own benchmark jobs to 
generate a pattern where the records are fully overwritten in a second (and a 
third) batch and it actually finishes fine.. 
   
   ```
   hudi:hoodie_benchmark->connect --path 
file:///tmp/hudi-benchmark/output/org.apache.hudi
   35394 [Spring Shell] INFO  
org.apache.hudi.common.table.HoodieTableMetaClient  - Loading 
HoodieTableMetaClient from file:///tmp/hudi-benchmark/output/org.apache.hudi
   35415 [Spring Shell] INFO  org.apache.hudi.common.util.FSUtils  - Hadoop 
Configuration: fs.defaultFS: [file:///], Config:[Configuration: 
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, 
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: 
[org.apache.hadoop.fs.LocalFileSystem@6851d345]
   35416 [Spring Shell] INFO  org.apache.hudi.common.table.HoodieTableConfig  - 
Loading table properties from 
file:/tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/hoodie.properties
   35416 [Spring Shell] INFO  
org.apache.hudi.common.table.HoodieTableMetaClient  - Finished Loading Table of 
type COPY_ON_WRITE(version=1) from 
file:///tmp/hudi-benchmark/output/org.apache.hudi
   Metadata for table hoodie_benchmark loaded
   hudi:hoodie_benchmark->commits show 
   36774 [Spring Shell] INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline  - Loaded instants 
[[20200213134159__clean__COMPLETED], [20200213134159__commit__COMPLETED], 
[20200213134410__clean__COMPLETED], [20200213134410__commit__COMPLETED], 
[20200213134548__clean__COMPLETED], [20200213134548__commit__COMPLETED]]
   
╔╤═╤═══╤═╤══╤═══╤══╤══╗
   ║ CommitTime │ Total Bytes Written │ Total Files Added │ Total Files 
Updated │ Total Partitions Written │ Total Records Written │ Total Update 
Records Written │ Total Errors ║
   
╠╪═╪═══╪═╪══╪═══╪══╪══╣
   ║ 20200213134548 │ 384.8 MB│ 0 │ 34  
│ 3│ 4080024   │ 1211376
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213134410 │ 379.9 MB│ 0 │ 34  
│ 3│ 4040016   │ 1199234
  │ 0║
   
╟┼─┼───┼─┼──┼───┼──┼──╢
   ║ 20200213134159 │ 374.8 MB│ 34│ 0   
│ 3│ 408   │ 0  
  │ 0║
   
╚╧═╧═══╧═╧══╧═══╧══╧══╝
   
   hudi:hoodie_benchmark->
   ```
   
   and the times below in ms
   
   ```
grep -n -e totalCreateTime -e totalUpsertTime  
/tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/*.commit 
   
/tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134159.commit:697:  
"totalCreateTime" : 195060,
   
/tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134159.commit:698:  
"totalUpsertTime" : 0,
   
/tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134410.commit:697:  
"totalCreateTime" : 0,
   
/tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134410.commit:698:  
"totalUpsertTime" : 193693,
   
/tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134548.commit:697:  
"totalCreateTime" : 0,
   
/tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134548.commit:698:  
"totalUpsertTime" : 182277,
   ```
   
   
   Can we drill into your dataset?  are you generating tons of files due to 
granular partitionining? can you share the spark UI and the hudi cli output 
like above?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nbalajee commented on a change in pull request #1312: [HUDI-571] Add "compactions show archived" command to CLI

2020-02-13 Thread GitBox
nbalajee commented on a change in pull request #1312: [HUDI-571] Add 
"compactions show archived" command to CLI
URL: https://github.com/apache/incubator-hudi/pull/1312#discussion_r379089359
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
 ##
 @@ -95,51 +101,9 @@ public String compactionsAll(
   throws IOException {
 HoodieTableMetaClient client = checkAndGetMetaClient();
 HoodieActiveTimeline activeTimeline = client.getActiveTimeline();
-HoodieTimeline timeline = activeTimeline.getCommitsAndCompactionTimeline();
-HoodieTimeline commitTimeline = 
activeTimeline.getCommitTimeline().filterCompletedInstants();
-Set committed = 
commitTimeline.getInstants().map(HoodieInstant::getTimestamp).collect(Collectors.toSet());
-
-List instants = 
timeline.getReverseOrderedInstants().collect(Collectors.toList());
-List rows = new ArrayList<>();
-for (HoodieInstant instant : instants) {
-  HoodieCompactionPlan compactionPlan = null;
-  if (!HoodieTimeline.COMPACTION_ACTION.equals(instant.getAction())) {
-try {
-  // This could be a completed compaction. Assume a compaction request 
file is present but skip if fails
-  compactionPlan = AvroUtils.deserializeCompactionPlan(
-  activeTimeline.readCompactionPlanAsBytes(
-  
HoodieTimeline.getCompactionRequestedInstant(instant.getTimestamp())).get());
-} catch (HoodieIOException ioe) {
-  // SKIP
-}
-  } else {
-compactionPlan = 
AvroUtils.deserializeCompactionPlan(activeTimeline.readCompactionPlanAsBytes(
-
HoodieTimeline.getCompactionRequestedInstant(instant.getTimestamp())).get());
-  }
-
-  if (null != compactionPlan) {
-State state = instant.getState();
-if (committed.contains(instant.getTimestamp())) {
-  state = State.COMPLETED;
-}
-if (includeExtraMetadata) {
-  rows.add(new Comparable[] {instant.getTimestamp(), state.toString(),
-  compactionPlan.getOperations() == null ? 0 : 
compactionPlan.getOperations().size(),
-  compactionPlan.getExtraMetadata().toString()});
-} else {
-  rows.add(new Comparable[] {instant.getTimestamp(), state.toString(),
-  compactionPlan.getOperations() == null ? 0 : 
compactionPlan.getOperations().size()});
-}
-  }
-}
-
-Map> fieldNameToConverterMap = new 
HashMap<>();
-TableHeader header = new TableHeader().addTableHeaderField("Compaction 
Instant Time").addTableHeaderField("State")
-.addTableHeaderField("Total FileIds to be Compacted");
-if (includeExtraMetadata) {
-  header = header.addTableHeaderField("Extra Metadata");
-}
-return HoodiePrintHelper.print(header, fieldNameToConverterMap, 
sortByField, descending, limit, headerOnly, rows);
+return printAllCompactions(activeTimeline,
+compactionPlanReader(this::readCompactionPlanForActiveTimeline, 
activeTimeline),
 
 Review comment:
   Got it.  LGTM.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-13 Thread GitBox
adamjoneill commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585932919
 
 
   @vinothchandar from my investigation above it would suggest it to be how 
hudi writes parquet data. 
   
   Whilst limited in its scope, and many moving parts, my investigation 
involved 
   
   1. taking a record that includes an array of complex objects (no primitive 
or "simple" types belong to the array item object) off the kinesis stream
   2. saving it to parquet in S3 using the dataFrame API
   3. then using the same record, save it using hudi to S3 
   4. AWS Glue crawls over these files, creates database and tables
   5. presto query `select * from table` against hudi parquet file fails
   6. presto query `select * from table` spark api file succeeds
   
   I agree it does seem strange and the stack trace does point to a presto 
issue with reading the array. Unfortunately I'm not 100% across the project to 
know where to begin debugging the issue. What can I do to find out further?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-13 Thread GitBox
vinothchandar commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585920479
 
 
   @adamjoneill I am wondering if this is simply a presto issue.. Hudi/Presto 
integration leaves all of the query planning and parquet reading to Presto 
itself..  Do you suspect , it is due to the way Hudi writes the parquet data?  
I have seen tons of nested data being queries successfully using hudi and 
presto, so trying to understand.. 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1314: [HUDI-542] Introduce a new pom module named hudi-writer-common

2020-02-13 Thread GitBox
vinothchandar commented on issue #1314: [HUDI-542] Introduce a new pom module 
named hudi-writer-common
URL: https://github.com/apache/incubator-hudi/pull/1314#issuecomment-585917397
 
 
   @yanghua as I also mentioned on the JIRA. What you suggest makes sense when 
we are in the phase of separating hudi-spark and hudi-writer-common.. This 
model also makes for us to take the hit of rebasing, without thrashing all 
existing PRs. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-581) NOTICE need more work as it missing content form included 3rd party ALv2 licensed NOTICE files

2020-02-13 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-581:

Fix Version/s: 0.6.0

> NOTICE need more work as it missing content form included 3rd party ALv2 
> licensed NOTICE files
> --
>
> Key: HUDI-581
> URL: https://issues.apache.org/jira/browse/HUDI-581
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: Suneel Marthi
>Priority: Major
> Fix For: 0.6.0
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-611) Impala sync tool

2020-02-13 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-611:
---

 Summary: Impala sync tool
 Key: HUDI-611
 URL: https://issues.apache.org/jira/browse/HUDI-611
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


Like sync to Hive. We need a tool to sync with Impala. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-610) Impala nea real time table support

2020-02-13 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-610:
---

 Summary: Impala nea real time table support
 Key: HUDI-610
 URL: https://issues.apache.org/jira/browse/HUDI-610
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


Impala uses the JAVA based module call "frontend" to list all the files to scan 
and let the C++ based "backend" to do all the file scanning. 

Merge Avro and Parquet could be difficult because it might need to have a 
custom merging logic like RealtimeCompactedRecordReader to be implemented in 
backend using C++, but I think it will be doable to have something like 
RealtimeUnmergedRecordReader which only need some changes in the frontend. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated: [MINOR] Fix typo (#1331)

2020-02-13 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 175de0d  [MINOR] Fix typo (#1331)
175de0d is described below

commit 175de0db7b5687e5f8602ccaf97dadd7758287e0
Author: Mathieu <49835526+wangxian...@users.noreply.github.com>
AuthorDate: Fri Feb 14 02:46:10 2020 +0800

[MINOR] Fix typo (#1331)
---
 .../hudi/utilities/deltastreamer/AbstractDeltaStreamerService.java  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/AbstractDeltaStreamerService.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/AbstractDeltaStreamerService.java
index 5d36e8d..8fe1a71 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/AbstractDeltaStreamerService.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/AbstractDeltaStreamerService.java
@@ -32,7 +32,7 @@ import java.util.concurrent.TimeUnit;
 import java.util.function.Function;
 
 /**
- * Base Class for running delta-sync/compaction in separate thread and 
controlling their life-cyle.
+ * Base Class for running delta-sync/compaction in separate thread and 
controlling their life-cycle.
  */
 public abstract class AbstractDeltaStreamerService implements Serializable {
 



[GitHub] [incubator-hudi] vinothchandar merged pull request #1331: [MINOR] Fix typo

2020-02-13 Thread GitBox
vinothchandar merged pull request #1331: [MINOR] Fix typo
URL: https://github.com/apache/incubator-hudi/pull/1331
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-13 Thread GitBox
vinothchandar commented on issue #1100: [HUDI-289] Implement a test suite to 
support long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-585910315
 
 
   >How have we validated the framework? are there real tests now? 
   
   Can you also please answer these?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-13 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585777129
 
 
   Had some success, if i added a property to the object of simple type ie. not 
a nested object hudi saves the parquet and the presto query (select * and 
nested) appear to work.
   
   ```
   const valueToUse = {
 ...order,
 tickets: {
   items: [
 {
   dummyId: 2, // <-- added this property
   ticketType: {
 id: 258815,
   },
 },
   ],
 },
   };
   ```
   
   
   Just to be sure, i saved the json to parquet using
   
   ```
   dataFrame.withColumn("year", year(col("eventTimestamp")))
   .withColumn("month", month(col("eventTimestamp")))
   .withColumn("day", dayofmonth(col("eventTimestamp")))
   .write
   .mode(SaveMode.Append)
   .partitionBy(typeOfEventColumnName, "year", "month", 
"day")
   .parquet(rawOutPath)
   ```
   
   had AWS glue crawl over it to create a table. Then ran a query in presto 
against the table  and `select * from table` worked fine, whereas on the hudi 
parquet it failed 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-13 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585777129
 
 
   Had some success, if i added a property to the object of simple type ie. not 
a nested object hudi saves the parquet and the presto query (select * and 
nested) appear to work.
   
   ```
   const valueToUse = {
 ...order,
 tickets: {
   items: [
 {
   dummyId: 2, // <-- added this property
   ticketType: {
 id: 258815,
   },
 },
   ],
 },
   };
   ```
   
   
   Just to be sure, i saved the rdd to parquet using
   
   ```
   dataFrame.withColumn("year", year(col("eventTimestamp")))
   .withColumn("month", month(col("eventTimestamp")))
   .withColumn("day", dayofmonth(col("eventTimestamp")))
   .write
   .mode(SaveMode.Append)
   .partitionBy(typeOfEventColumnName, "year", "month", 
"day")
   .parquet(rawOutPath)
   ```
   
   had AWS glue crawl over it to create a table. Then ran a query in presto 
against the table  and `select * from table` worked fine, whereas on the hudi 
parquet it failed 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-609) Implement a Flink specific HoodieIndex

2020-02-13 Thread vinoyang (Jira)
vinoyang created HUDI-609:
-

 Summary: Implement a Flink specific HoodieIndex
 Key: HUDI-609
 URL: https://issues.apache.org/jira/browse/HUDI-609
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: vinoyang
Assignee: vinoyang


Indexing is a key step in hudi's write flow. {{HoodieIndex}} is the super 
abstract class of all the implement of the index. Currently, {{HoodieIndex}} 
couples with Spark in the design. However, HUDI-538 is doing the restructure 
for hudi-client so that hudi can be decoupled with Spark. After that, we would 
get an engine-irrelevant implementation of {{HoodieIndex}}. And extending that 
class, we could implement a Flink specific index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] wangxianghu opened a new pull request #1331: [MINOR] Fix typo

2020-02-13 Thread GitBox
wangxianghu opened a new pull request #1331: [MINOR] Fix typo
URL: https://github.com/apache/incubator-hudi/pull/1331
 
 
   ## What is the purpose of the pull request
   
   *Fix typo*
   
   ## Brief change log
   
   *Fix typo*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-608) Implement a flink datastream execution context

2020-02-13 Thread vinoyang (Jira)
vinoyang created HUDI-608:
-

 Summary: Implement a flink datastream execution context
 Key: HUDI-608
 URL: https://issues.apache.org/jira/browse/HUDI-608
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: vinoyang
Assignee: vinoyang


Currently {{HoodieWriteClient}} does something like 
`hoodieRecordRDD.map().sort()` internally.. if we want to support Flink 
DataStream as the object, then we need to somehow define an abstraction like 
{{HoodieExecutionContext}}  which will have a common set of map(T) -> T, 
filter(), repartition() methods. There will be subclass like 
{{HoodieFlinkDataStreamExecutionContext}} which will implement it 
in Flink specific ways and hand back the transformed T object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated: [HUDI-574] Fix CLI counts small file inserts as updates (#1321)

2020-02-13 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 01c868a  [HUDI-574] Fix CLI counts small file inserts as updates 
(#1321)
01c868a is described below

commit 01c868ab86e33161b03d422c03af41f01947ea06
Author: lamber-ken 
AuthorDate: Thu Feb 13 22:20:58 2020 +0800

[HUDI-574] Fix CLI counts small file inserts as updates (#1321)
---
 hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
index 8989820..1e17c4c 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
@@ -268,11 +268,11 @@ public class CommitsCommand implements CommandMarker {
   for (HoodieWriteStat stat : stats) {
 if (stat.getPrevCommit().equals(HoodieWriteStat.NULL_COMMIT)) {
   totalFilesAdded += 1;
-  totalRecordsInserted += stat.getNumWrites();
 } else {
   totalFilesUpdated += 1;
   totalRecordsUpdated += stat.getNumUpdateWrites();
 }
+totalRecordsInserted += stat.getNumInserts();
 totalBytesWritten += stat.getTotalWriteBytes();
 totalWriteErrors += stat.getTotalWriteErrors();
   }



[GitHub] [incubator-hudi] leesf merged pull request #1321: [HUDI-574] Fix CLI counts small file inserts as updates

2020-02-13 Thread GitBox
leesf merged pull request #1321: [HUDI-574] Fix CLI counts small file inserts 
as updates
URL: https://github.com/apache/incubator-hudi/pull/1321
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-13 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585777129
 
 
   Had some success, if i added a property to the object of simple type ie. not 
a nested object hudi saves the parquet and the presto query (select * and 
nested) appear to work.
   
   ```
   const valueToUse = {
 ...order,
 tickets: {
   items: [
 {
   dummyId: 2, // <-- added this property
   ticketType: {
 id: 258815,
   },
 },
   ],
 },
   };
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-13 Thread GitBox
adamjoneill commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585777129
 
 
   Had some success, if i added a property to the object of simple type ie. not 
a nested object it appears to work.
   
   ```
   const valueToUse = {
 ...order,
 tickets: {
   items: [
 {
   dummyId: 2, // <-- added this property
   ticketType: {
 id: 258815,
   },
 },
   ],
 },
   };
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch hudi_test_suite_refactor updated (7a0794a -> b04b037)

2020-02-13 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard 7a0794a  Fix compile error after rebasing the branch
 add b04b037  [MINOR] Fix compile error after rebasing the branch

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (7a0794a)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (b04b037)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 hudi-test-suite/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-13 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585748045
 
 
   Futher update
   I've narrowed it down to the array on the nested object. 
   
   The following works when i place the following object on the kinesis stream
   
   ```
   const valueToUse = {
 ...order,
 tickets: {
   items: [],
 },
   };
   ```
   
   but when placed on the kinesis stream with an object in the array it fails
   
   ```
   const valueToUse = {
 ...order,
 tickets: {
   items: [
 {
   ticketType: {
 id: 258815,
   },
 },
   ],
 },
   };
   ```
   
   error message and stack trace when running `select * from table`
   
   ```
   Query 20200213_131029_00032_hej8h failed: No value present
   java.util.NoSuchElementException: No value present
at java.util.Optional.get(Optional.java:135)
at 
com.facebook.presto.parquet.reader.ParquetReader.readArray(ParquetReader.java:156)
at 
com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:282)
at 
com.facebook.presto.parquet.reader.ParquetReader.readStruct(ParquetReader.java:193)
at 
com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:276)
at 
com.facebook.presto.parquet.reader.ParquetReader.readStruct(ParquetReader.java:193)
at 
com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:276)
at 
com.facebook.presto.parquet.reader.ParquetReader.readStruct(ParquetReader.java:193)
at 
com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:276)
at 
com.facebook.presto.parquet.reader.ParquetReader.readBlock(ParquetReader.java:268)
at 
com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:247)
at 
com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:225)
at 
com.facebook.presto.spi.block.LazyBlock.assureLoaded(LazyBlock.java:283)
at 
com.facebook.presto.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:274)
at com.facebook.presto.spi.Page.getLoadedPage(Page.java:261)
at 
com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:254)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:379)
at 
com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:283)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675)
at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
at 
com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077)
at 
com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at 
com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483)
at 
com.facebook.presto.$gen.Presto_0_22720200211_134743_1.run(Unknown Source)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-13 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585748045
 
 
   Futher update
   I've narrowed it down to the array on the nested object. 
   
   The following works when i place the following object on the kinesis stream
   
   ```
   const valueToUse = {
 ...order,
 tickets: {
   items: [],
 },
   };
   ```
   
   but when placed on the kinesis stream with an object in the array it fails
   
   ```
   const valueToUse = {
 ...order,
 tickets: {
   items: [
 {
   ticketType: {
 id: 258815,
   },
 },
   ],
 },
   };
   ```
   
   error message and stack trace
   
   ```
   Query 20200213_131029_00032_hej8h failed: No value present
   java.util.NoSuchElementException: No value present
at java.util.Optional.get(Optional.java:135)
at 
com.facebook.presto.parquet.reader.ParquetReader.readArray(ParquetReader.java:156)
at 
com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:282)
at 
com.facebook.presto.parquet.reader.ParquetReader.readStruct(ParquetReader.java:193)
at 
com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:276)
at 
com.facebook.presto.parquet.reader.ParquetReader.readStruct(ParquetReader.java:193)
at 
com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:276)
at 
com.facebook.presto.parquet.reader.ParquetReader.readStruct(ParquetReader.java:193)
at 
com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:276)
at 
com.facebook.presto.parquet.reader.ParquetReader.readBlock(ParquetReader.java:268)
at 
com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:247)
at 
com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:225)
at 
com.facebook.presto.spi.block.LazyBlock.assureLoaded(LazyBlock.java:283)
at 
com.facebook.presto.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:274)
at com.facebook.presto.spi.Page.getLoadedPage(Page.java:261)
at 
com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:254)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:379)
at 
com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:283)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675)
at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
at 
com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077)
at 
com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at 
com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483)
at 
com.facebook.presto.$gen.Presto_0_22720200211_134743_1.run(Unknown Source)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-13 Thread GitBox
adamjoneill commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-585748045
 
 
   Futher update
   I've narrowed it down to the array on the nested object. 
   
   The following works when i place the following object on the kinesis stream
   
   ```
   const valueToUse = {
 ...order,
 tickets: {
   items: [],
 },
   };
   ```
   
   but when placed on the kinesis stream with an object in the array it fails
   
   ```
   const valueToUse = {
 ...order,
 tickets: {
   items: [
 {
   ticketType: {
 id: 258815,
   },
 },
   ],
 },
   };
   ```
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch hudi_test_suite_refactor updated (5f22849 -> 7a0794a)

2020-02-13 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


omit 5f22849  [HUDI-592] Remove duplicated dependencies in the pom file of 
test suite module
omit 044759a  [HUDI-503] Add hudi test suite documentation into the README 
file of the test suite module (#1191)
omit b66ba8d  [MINOR] Fix TestHoodieTestSuiteJob#testComplexDag failure
omit eaa4fb0  [HUDI-441] Rename WorkflowDagGenerator and some class names 
in test package
omit 33246c4  Fixed resource leak in HiveTestService about hive meta store
omit 7d67d5e  [HUDI-591] Support Spark version upgrade
omit 0456214  [HUDI-442] Fix 
TestComplexKeyGenerator#testSingleValueKeyGenerator and 
testMultipleValueKeyGenerator NPE
omit f8d25b1  [MINOR] Fix compile error about the deletion of 
HoodieActiveTimeline#createNewCommitTime
omit 53e73e5  [HUDI-391] Rename module name from hudi-bench to 
hudi-test-suite and fix some checkstyle issues (#1102)
omit d5549f5  [HUDI-394] Provide a basic implementation of test suite
 add 2bb0c21  Fix conversion of Spark struct type to Avro schema
 add 9b2944a  [MINOR] Refactor unnecessary boxing inside TypedProperties 
code (#1227)
 add 87fdb76  Adding util methods to assist in adding deletion support to 
Quick Start
 add 2b2f23a  Fixing delete util method
 add 2248fd9  Fixing checkstyle issues
 add 7aa3ce3  [MINOR] Fix redundant judgment statement (#1231)
 add dd09abb  [HUDI-335] Improvements to DiskBasedMap used by 
ExternalSpillableMap, for write and random/sequential read paths, by 
introducing bufferedRandmomAccessFile
 add 1daba24  Add GlobalDeleteKeyGenerator
 add b39458b  [MINOR] Make constant fields final in HoodieTestDataGenerator 
(#1234)
 add 8a3a503  [MINOR] Fix missing @Override annotation on 
BufferedRandomAccessFile method (#1236)
 add c2c0f6b  [HUDI-509] Renaming code in sync with cWiki restructuring 
(#1212)
 add baa6b5e  [HUDI-537] Introduce `repair overwrite-hoodie-props` CLI 
command (#1241)
 add 0a07752  [HUDI-527] scalastyle-maven-plugin moved to pluginManagement 
as it is only used in hoodie-spark and hoodie-cli modules.
 add 923e2b4  [HUDI-535] Ensure Compaction Plan is always written in .aux 
folder to avoid 0.5.0/0.5.1 reader-writer compatibility issues (#1229)
 add 292c1e2  [HUDI-238] Make Hudi support Scala 2.12 (#1226)
 add 5471d8f  [MINOR] Add toString method to TimelineLayoutVersion to make 
it more readable (#1244)
 add 3f4966d  [MINOR] Fix PMC in DOAP] (#1247)
 add d0ee95e  [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion 
(#1246)
 add 9489d0f  [HUDI-551] Abstract a test case class for DFS Source to make 
it extensible (#1239)
 add 7087e7d  [HUDI-556] Add lisence for PR#1233
 add ba54a7e  [HUDI-559] : Make the timeline layout version default to be 
null version
 add 6e59c1c  Moving to 0.5.2-SNAPSHOT on master branch.
 add 924bf51  [MINOR] Download KEYS file when validating release candidate 
(#1259)
 add b6e2993  [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles 
(#1263)
 add a54535e  [MINOR] Fix invalid maven repo address (#1265)
 add a46fea9  [MINOR] Change deploy_staging_jars script to take in scala 
version (#1269)
 add 8e3d81c  [MINOR] Change deploy_staging_jars script to take in scala 
version (#1270)
 add ed54eb2  [MINOR] Add missing licenses (#1271)
 add fc8d4a7  [MINOR] fix license issue (#1273)
 add 1e79cbc  [HUDI-549] update Github README with instructions to build 
with Scala 2.12 (#1275)
 add cdb028f  [MINOR] Fix missing groupId / version property of dependency
 add 56a4e0d  [MINOR] Fix invalid issue url & quickstart url (#1282)
 add 362a9b9  [MINOR] Remove junit-dep dependency
 add c06ec8b  [MINOR] Fix assigning to configuration more times (#1291)
 add 6f34be1  HUDI-117 Close file handle before throwing an exception due 
to append failure. Add test cases to handle/verify stage failure scenarios.
 add 652224e  [HUDI-578] Trim recordKeyFields and partitionPathFields in 
ComplexKeyGenerator (#1281)
 add f27c7a1  [HUDI-564] Added new test cases for HoodieLogFormat and 
HoodieLogFormatVersion.
 add 5b7bb14  [HUDI-583] Code Cleanup, remove redundant code, and other 
changes (#1237)
 add 0026234  [MINOR] Updated DOAP with 0.5.1 release (#1300)
 add fcf9e4a  [MINOR] Updated DOAP with 0.5.1 release (#1301)
 add d07ac58  Increase test coverage for HoodieReadClient
 add 347e297  [HUDI-596] Close KafkaConsumer every time (#1303)
 add 594da28  [HUDI-595] code cleanup, refactoring code out of PR# 1159 
(#1302)
 add 4de0fcf  [HUDI-566] Added new test cases for class HoodieTimeline, 
HoodieDefaultTimeline and HoodieActiveTimeline.
 add 425e3e6  [HUDI-585] Optimize the steps of building with scala-2.12 
(#1293)
 add