[GitHub] [incubator-hudi] zhaomin1423 commented on issue #1431: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
zhaomin1423 commented on issue #1431: [HUDI-65]commitTime rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1431#issuecomment-601999146
 
 
   I have modified all the above mentioned.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] zhaomin1423 commented on a change in pull request #1431: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
zhaomin1423 commented on a change in pull request #1431: [HUDI-65]commitTime 
rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1431#discussion_r395964258
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/SavepointsCommand.java
 ##
 @@ -63,36 +63,36 @@ public String showSavepoints() throws IOException {
   }
 
   @CliCommand(value = "savepoint create", help = "Savepoint a commit")
-  public String savepoint(@CliOption(key = {"commit"}, help = "Commit to 
savepoint") final String commitTime,
+  public String savepoint(@CliOption(key = {"commit"}, help = "Commit to 
savepoint") final String instantTime,
 
 Review comment:
   Thanks, I will fix it


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #1430: [hotfix] Improve code quality of HoodieAvroWriteSupport

2020-03-20 Thread GitBox
leesf commented on issue #1430: [hotfix] Improve code quality of 
HoodieAvroWriteSupport
URL: https://github.com/apache/incubator-hudi/pull/1430#issuecomment-601998600
 
 
   Warm welcome for your contributing @TisonKun . FYI: 
http://hudi.apache.org/contributing.html#contributing-code


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-20 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063775#comment-17063775
 ] 

lamber-ken commented on HUDI-716:
-

The life cycle of *.clean files: (base on hudi-0.5.0) 
[https://github.com/apache/incubator-hudi/blob/release-0.5.0|https://github.com/apache/incubator-hudi/blob/release-0.5.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java]

!image-2020-03-21-13-37-17-039.png!

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-02-45-25-099.png, 
> image-2020-03-21-13-37-17-039.png
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-20 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-716:

Attachment: image-2020-03-21-13-37-17-039.png

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-02-45-25-099.png, 
> image-2020-03-21-13-37-17-039.png
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] TisonKun closed pull request #1430: [hotfix] Improve code quality of HoodieAvroWriteSupport

2020-03-20 Thread GitBox
TisonKun closed pull request #1430: [hotfix] Improve code quality of 
HoodieAvroWriteSupport
URL: https://github.com/apache/incubator-hudi/pull/1430
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] TisonKun commented on issue #1430: [hotfix] Improve code quality of HoodieAvroWriteSupport

2020-03-20 Thread GitBox
TisonKun commented on issue #1430: [hotfix] Improve code quality of 
HoodieAvroWriteSupport
URL: https://github.com/apache/incubator-hudi/pull/1430#issuecomment-601997962
 
 
   @yanghua it seems travis under maintenance so that offline and cannot give a 
result.
   
   @vinothchandar Thanks for your information. I opened this PR to get an image 
of Hudi community taste. Will close this one and modify code when we have a 
bugfix or feature request. BTW, I don't see a CONTRIBUTING guide in conspicuous 
place. Can you share the link?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1431: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
vinothchandar commented on a change in pull request #1431: [HUDI-65]commitTime 
rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1431#discussion_r395962069
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/SavepointsCommand.java
 ##
 @@ -63,36 +63,36 @@ public String showSavepoints() throws IOException {
   }
 
   @CliCommand(value = "savepoint create", help = "Savepoint a commit")
-  public String savepoint(@CliOption(key = {"commit"}, help = "Commit to 
savepoint") final String commitTime,
+  public String savepoint(@CliOption(key = {"commit"}, help = "Commit to 
savepoint") final String instantTime,
 
 Review comment:
   this is specifically doing a savepoint for a commit.. so `commitTime` is 
actually apt


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1431: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
vinothchandar commented on a change in pull request #1431: [HUDI-65]commitTime 
rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1431#discussion_r395962699
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -426,12 +426,12 @@ public HoodieRecord generateUpdateRecord(HoodieKey key, 
String commitTime) throw
   /**
* Generates deduped updates of keys previously inserted, randomly 
distributed across the keys above.
*
-   * @param commitTime Commit Timestamp
+   * @param instantTime Commit Timestamp
 
 Review comment:
   here too.. can you make a second pass everywhere


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1431: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
vinothchandar commented on a change in pull request #1431: [HUDI-65]commitTime 
rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1431#discussion_r395962585
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -129,46 +129,46 @@ public static void writePartitionMetadata(FileSystem fs, 
String[] partitionPaths
* retaining the key if optionally provided.
*
* @param key  Hoodie key.
-   * @param commitTime  Commit time to use.
+   * @param instantTime  Commit time to use.
 
 Review comment:
   instant time to use? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1431: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
vinothchandar commented on a change in pull request #1431: [HUDI-65]commitTime 
rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1431#discussion_r395962682
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -409,15 +409,15 @@ public HoodieRecord generateUpdateRecord(HoodieKey key, 
String commitTime) throw
* Generates new updates, randomly distributed across the keys above. There 
can be duplicates within the returned
* list
*
-   * @param commitTime Commit Timestamp
+   * @param instantTime Commit Timestamp
 
 Review comment:
   fix param doc?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1431: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
vinothchandar commented on a change in pull request #1431: [HUDI-65]commitTime 
rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1431#discussion_r395962757
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java
 ##
 @@ -160,15 +160,15 @@ private boolean isFileSliceCommitted(FileSlice slice) {
   }
 
   /**
-   * Obtain the latest file slice, upto a commitTime i.e <= maxCommitTime.
+   * Obtain the latest file slice, upto a instantTime i.e <= maxInstantTime.
 
 Review comment:
   an instant time


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] Rajpratik71 commented on issue #1400: optimization debian package manager tweaks

2020-03-20 Thread GitBox
Rajpratik71 commented on issue #1400: optimization debian package manager tweaks
URL: https://github.com/apache/incubator-hudi/pull/1400#issuecomment-601996115
 
 
   > @Rajpratik71 : Just pinging to see if you are planning to work on this PR.
   
   will do , currently occupied.   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1430: [hotfix] Improve code quality of HoodieAvroWriteSupport

2020-03-20 Thread GitBox
vinothchandar commented on issue #1430: [hotfix] Improve code quality of 
HoodieAvroWriteSupport
URL: https://github.com/apache/incubator-hudi/pull/1430#issuecomment-601993412
 
 
   @TisonKun can you please follow the contribution guide where we suggest 
filing a JIRA and getting this signed off by a committer first.
   
   There is no hotfix prefix we support for PRs and this IMO is not a hotfix.. 
the change itself I am not sure improves the code. Templatizing to  is 
not very useful IMO 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1430: [hotfix] Improve code quality of HoodieAvroWriteSupport

2020-03-20 Thread GitBox
yanghua commented on issue #1430: [hotfix] Improve code quality of 
HoodieAvroWriteSupport
URL: https://github.com/apache/incubator-hudi/pull/1430#issuecomment-601993124
 
 
   Thanks for your contribution @TisonKun . The Travis is red, can you please 
have a look?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #223

2020-03-20 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.42 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[jira] [Updated] (HUDI-720) NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-20 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-720:
--
Fix Version/s: 0.5.2

> NOTICE file needs to add more content based on the NOTICE files of the ASF 
> projects that hudi bundles
> -
>
> Key: HUDI-720
> URL: https://issues.apache.org/jira/browse/HUDI-720
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Based on Justion's suggestion on general@ voting thread[1]. The NOTICE file 
> needs more work.
> [1]: 
> http://mail-archives.apache.org/mod_mbox/incubator-general/202003.mbox/%3C932F44A0-1CEE-4549-896B-70FB61EAA034%40classsoftware.com%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-720) NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-20 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-720:
--
Status: Open  (was: New)

> NOTICE file needs to add more content based on the NOTICE files of the ASF 
> projects that hudi bundles
> -
>
> Key: HUDI-720
> URL: https://issues.apache.org/jira/browse/HUDI-720
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Based on Justion's suggestion on general@ voting thread[1]. The NOTICE file 
> needs more work.
> [1]: 
> http://mail-archives.apache.org/mod_mbox/incubator-general/202003.mbox/%3C932F44A0-1CEE-4549-896B-70FB61EAA034%40classsoftware.com%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-720) NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-20 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-720.
-
Resolution: Fixed

Fixed via master branch: c5030f77a0e63f609ed2c674bea00201b97d8bb6

> NOTICE file needs to add more content based on the NOTICE files of the ASF 
> projects that hudi bundles
> -
>
> Key: HUDI-720
> URL: https://issues.apache.org/jira/browse/HUDI-720
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Based on Justion's suggestion on general@ voting thread[1]. The NOTICE file 
> needs more work.
> [1]: 
> http://mail-archives.apache.org/mod_mbox/incubator-general/202003.mbox/%3C932F44A0-1CEE-4549-896B-70FB61EAA034%40classsoftware.com%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua merged pull request #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-20 Thread GitBox
yanghua merged pull request #1417: [HUDI-720] NOTICE file needs to add more 
content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles (#1417)

2020-03-20 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new c5030f7  [HUDI-720] NOTICE file needs to add more content based on the 
NOTICE files of the ASF projects that hudi bundles (#1417)
c5030f7 is described below

commit c5030f77a0e63f609ed2c674bea00201b97d8bb6
Author: vinoyang 
AuthorDate: Sat Mar 21 10:54:04 2020 +0800

[HUDI-720] NOTICE file needs to add more content based on the NOTICE files 
of the ASF projects that hudi bundles (#1417)

* [HUDI-720] NOTICE file needs to add more content based on the NOTICE 
files of the ASF projects that hudi bundles
---
 LICENSE |   9 
 NOTICE  | 146 +++-
 2 files changed, 154 insertions(+), 1 deletion(-)

diff --git a/LICENSE b/LICENSE
index 28dfacd..ed8458a 100644
--- a/LICENSE
+++ b/LICENSE
@@ -321,3 +321,12 @@ Copyright (c) 2005, European Commission project OneLab 
under contract 034819 (ht
  License: http://www.apache.org/licenses/LICENSE-2.0
 
  
---
+
+ This product includes code from Apache commons-lang
+
+ * org.apache.hudi.common.util.collection.Pair adapted from 
org.apache.commons.lang3.tuple.Pair
+
+ Copyright 2001-2020 The Apache Software Foundation
+
+ Home page: https://commons.apache.org/proper/commons-lang/
+ License: http://www.apache.org/licenses/LICENSE-2.0
diff --git a/NOTICE b/NOTICE
index ecd4479..59e56b4 100644
--- a/NOTICE
+++ b/NOTICE
@@ -1,5 +1,149 @@
 Apache Hudi (incubating)
-Copyright 2019 and onwards The Apache Software Foundation
+Copyright 2019-2020 The Apache Software Foundation
 
 This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
+
+
+
+This product includes code from Apache Hive, which includes the following in
+its NOTICE file:
+
+  Apache Hive
+  Copyright 2008-2018 The Apache Software Foundation
+
+  This product includes software developed by The Apache Software
+  Foundation (http://www.apache.org/).
+
+  This project includes software licensed under the JSON license.
+
+
+
+This product includes code from Apache SystemML, which includes the following 
in
+its NOTICE file:
+
+  Apache SystemML
+  Copyright [2015-2018] The Apache Software Foundation
+
+  This product includes software developed at
+  The Apache Software Foundation (http://www.apache.org/).
+
+
+
+This product includes code from Apache Spark, which includes the following in
+its NOTICE file:
+
+  Apache Spark
+  Copyright 2014 and onwards The Apache Software Foundation.
+
+  This product includes software developed at
+  The Apache Software Foundation (http://www.apache.org/).
+
+
+  Export Control Notice
+  -
+
+  This distribution includes cryptographic software. The country in which you 
currently reside may have
+  restrictions on the import, possession, use, and/or re-export to another 
country, of encryption software.
+  BEFORE using any encryption software, please check your country's laws, 
regulations and policies concerning
+  the import, possession, or use, and re-export of encryption software, to see 
if this is permitted. See
+   for more information.
+
+  The U.S. Government Department of Commerce, Bureau of Industry and Security 
(BIS), has classified this
+  software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes 
information security software
+  using or performing cryptographic functions with asymmetric algorithms. The 
form and manner of this Apache
+  Software Foundation distribution makes it eligible for export under the 
License Exception ENC Technology
+  Software Unrestricted (TSU) exception (see the BIS Export Administration 
Regulations, Section 740.13) for
+  both object code and source code.
+
+  The following provides more details on the included cryptographic software:
+
+  This software uses Apache Commons Crypto 
(https://commons.apache.org/proper/commons-crypto/) to
+  support authentication, and encryption and decryption of data sent across 
the network between
+  services.
+
+
+  Metrics
+  Copyright 2010-2013 Coda Hale and Yammer, Inc.
+
+  This product includes software developed by Coda Hale and Yammer, Inc.
+
+  This product includes code derived from the JSR-166 project 
(ThreadLocalRandom, Striped64,
+  LongAdder), which was released with the following comments:
+
+  Written by Doug Lea with assistance from members of JCP JSR-166
+  Expert Group and released to the public domain, as explained at
+  

[GitHub] [incubator-hudi] yanghua commented on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-20 Thread GitBox
yanghua commented on issue #1417: [HUDI-720] NOTICE file needs to add more 
content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-601983292
 
 
   Thanks everyone, merging this PR!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1431: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
codecov-io edited a comment on issue #1431: [HUDI-65]commitTime rename to 
instantTime
URL: https://github.com/apache/incubator-hudi/pull/1431#issuecomment-601980537
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=h1) 
Report
   > Merging 
[#1431](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/83fb9651f368351446bb6ef0c9ebde34cf61b809=desc)
 will **not change** coverage by `%`.
   > The diff coverage is `79.73%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1431/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1431   +/-   ##
   =
 Coverage 67.56%   67.56%   
 Complexity  255  255   
   =
 Files   340  340   
 Lines 1651416514   
 Branches   1689 1689   
   =
 Hits  1115811158   
 Misses 4618 4618   
 Partials738  738   
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...c/main/java/org/apache/hudi/index/HoodieIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSG9vZGllSW5kZXguamF2YQ==)
 | `88.23% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../java/org/apache/hudi/index/InMemoryHashIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSW5NZW1vcnlIYXNoSW5kZXguamF2YQ==)
 | `87.87% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/index/bloom/HoodieBloomIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleC5qYXZh)
 | `94.73% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...n/java/org/apache/hudi/index/hbase/HBaseIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvaGJhc2UvSEJhc2VJbmRleC5qYXZh)
 | `84.21% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/table/HoodieTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllVGFibGUuamF2YQ==)
 | `78.09% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/utilities/HoodieSnapshotCopier.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90Q29waWVyLmphdmE=)
 | `14.10% <0.00%> (ø)` | `3.00 <0.00> (ø)` | |
   | 
[...che/hudi/utilities/sources/HiveIncrPullSource.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSGl2ZUluY3JQdWxsU291cmNlLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==)
 | `50.56% <50.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=)
 | `52.79% <63.63%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `72.27% <63.63%> (ø)` | `38.00 <0.00> (ø)` | |
   | ... and [24 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=footer).
 Last update 
[83fb965...c8d3141](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=lastupdated).
 Read the [comment 

[GitHub] [incubator-hudi] codecov-io commented on issue #1431: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
codecov-io commented on issue #1431: [HUDI-65]commitTime rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1431#issuecomment-601980537
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=h1) 
Report
   > Merging 
[#1431](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/83fb9651f368351446bb6ef0c9ebde34cf61b809=desc)
 will **not change** coverage by `%`.
   > The diff coverage is `79.73%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1431/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1431   +/-   ##
   =
 Coverage 67.56%   67.56%   
 Complexity  255  255   
   =
 Files   340  340   
 Lines 1651416514   
 Branches   1689 1689   
   =
 Hits  1115811158   
 Misses 4618 4618   
 Partials738  738   
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...c/main/java/org/apache/hudi/index/HoodieIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSG9vZGllSW5kZXguamF2YQ==)
 | `88.23% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../java/org/apache/hudi/index/InMemoryHashIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSW5NZW1vcnlIYXNoSW5kZXguamF2YQ==)
 | `87.87% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/index/bloom/HoodieBloomIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleC5qYXZh)
 | `94.73% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...n/java/org/apache/hudi/index/hbase/HBaseIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvaGJhc2UvSEJhc2VJbmRleC5qYXZh)
 | `84.21% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/table/HoodieTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllVGFibGUuamF2YQ==)
 | `78.09% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/utilities/HoodieSnapshotCopier.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90Q29waWVyLmphdmE=)
 | `14.10% <0.00%> (ø)` | `3.00 <0.00> (ø)` | |
   | 
[...che/hudi/utilities/sources/HiveIncrPullSource.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSGl2ZUluY3JQdWxsU291cmNlLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==)
 | `50.56% <50.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=)
 | `52.79% <63.63%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `72.27% <63.63%> (ø)` | `38.00 <0.00> (ø)` | |
   | ... and [24 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1431/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=footer).
 Last update 
[83fb965...c8d3141](https://codecov.io/gh/apache/incubator-hudi/pull/1431?src=pr=lastupdated).
 Read the [comment 

[GitHub] [incubator-hudi] zhaomin1423 opened a new pull request #1431: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
zhaomin1423 opened a new pull request #1431: [HUDI-65]commitTime rename to 
instantTime
URL: https://github.com/apache/incubator-hudi/pull/1431
 
 
   What is the purpose of the pull request
   Cleanup nomenclature around commitTime and instantTime
   
   Brief change log
   commtTime rename to instantTime
   
   Verify this pull request
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] TisonKun opened a new pull request #1430: [hotfix] Improve code quality of HoodieAvroWriteSupport

2020-03-20 Thread GitBox
TisonKun opened a new pull request #1430: [hotfix] Improve code quality of 
HoodieAvroWriteSupport
URL: https://github.com/apache/incubator-hudi/pull/1430
 
 
   ## Brief change log
   
   - Reduce raw use of parameterized class(AvroWriteSupport)
   - Replace deprecated method call with successor
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit (*I'm not sure whether 
it is worth a JIRA*)

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [x] Necessary doc changes done or have another open PR(**not applicable**)
  
- [x] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.(**not applicable**)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-03-20 Thread Feichi Feng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063697#comment-17063697
 ] 

Feichi Feng commented on HUDI-724:
--

Hi [~vbalaji], partition touched depends on our data, the screenshot I attached 
for "nogapAfterImprovement", it looks there are >= 45 partitions touched(was 
hard-coded parallelism at that time). We do have a lot of files in a 
partition(randomly checked one, it has 6000+ files, probably caused by 
recordsize not been correctly set). 

> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Time Spent: 0.5h
>  Remaining Estimate: 47.5h
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
codecov-io edited a comment on issue #1421: [HUDI-724] Parallelize 
getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#issuecomment-601964769
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=h1) 
Report
   > Merging 
[#1421](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/83fb9651f368351446bb6ef0c9ebde34cf61b809=desc)
 will **increase** coverage by `0.00%`.
   > The diff coverage is `94.73%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1421/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1421   +/-   ##
   =
 Coverage 67.56%   67.57%   
 Complexity  255  255   
   =
 Files   340  340   
 Lines 1651416520+6 
 Branches   1689 1690+1 
   =
   + Hits  1115811163+5 
 Misses 4618 4618   
   - Partials738  739+1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...c/main/java/org/apache/hudi/table/HoodieTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1421/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllVGFibGUuamF2YQ==)
 | `78.09% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1421/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `89.94% <92.85%> (-0.06%)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/client/HoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1421/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVdyaXRlQ2xpZW50LmphdmE=)
 | `69.77% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieMergeOnReadTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1421/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllTWVyZ2VPblJlYWRUYWJsZS5qYXZh)
 | `85.62% <100.00%> (-0.18%)` | `0.00 <0.00> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=footer).
 Last update 
[83fb965...201b60c](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] codecov-io commented on issue #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
codecov-io commented on issue #1421: [HUDI-724] Parallelize getSmallFiles for 
partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#issuecomment-601964769
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=h1) 
Report
   > Merging 
[#1421](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/83fb9651f368351446bb6ef0c9ebde34cf61b809=desc)
 will **increase** coverage by `0.00%`.
   > The diff coverage is `94.73%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1421/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1421   +/-   ##
   =
 Coverage 67.56%   67.57%   
 Complexity  255  255   
   =
 Files   340  340   
 Lines 1651416520+6 
 Branches   1689 1690+1 
   =
   + Hits  1115811163+5 
 Misses 4618 4618   
   - Partials738  739+1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...c/main/java/org/apache/hudi/table/HoodieTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1421/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllVGFibGUuamF2YQ==)
 | `78.09% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1421/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `89.94% <92.85%> (-0.06%)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/client/HoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1421/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVdyaXRlQ2xpZW50LmphdmE=)
 | `69.77% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieMergeOnReadTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1421/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllTWVyZ2VPblJlYWRUYWJsZS5qYXZh)
 | `85.62% <100.00%> (-0.18%)` | `0.00 <0.00> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=footer).
 Last update 
[83fb965...201b60c](https://codecov.io/gh/apache/incubator-hudi/pull/1421?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ffcchi commented on issue #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
ffcchi commented on issue #1421: [HUDI-724] Parallelize getSmallFiles for 
partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#issuecomment-601958868
 
 
   * removed more unnecessary JavaSparkContext passing. 
   * rebased against master


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-03-20 Thread GitBox
bwu2 commented on issue #1128: [HUDI-453] Fix throw failed to archive commits 
error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-601916430
 
 
   That is because i copied and pasted the folder to a new location for 
troubleshooting. The actual  creation times match the commit times pretty much.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-728) Support for complex record keys with TimestampBasedKeyGenerator

2020-03-20 Thread Pratyaksh Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma updated HUDI-728:
--
Status: Open  (was: New)

> Support for complex record keys with TimestampBasedKeyGenerator
> ---
>
> Key: HUDI-728
> URL: https://issues.apache.org/jira/browse/HUDI-728
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Usability
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.6.0
>
>
> We have TimestampBasedKeyGenerator for defining custom partition paths and we 
> have ComplexKeyGenerator for supporting having combination of fields as 
> record key or partition key. 
>  
> However we do not have support for the case where one wants to have 
> combination of fields as record key along with being able to define custom 
> partition paths. This use case recently came up at my organisation. 
>  
> How about having CustomTimestampBasedKeyGenerator which supports the above 
> use case? This class can simply extend TimestampBasedKeyGenerator and allow 
> users to have combination of fields as record key.
>  
> We will try to have the implementation as generic as possible. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] ffcchi opened a new pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
ffcchi opened a new pull request #1421: [HUDI-724] Parallelize getSmallFiles 
for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421
 
 
   
   ## What is the purpose of the pull request
   
   *parallelizing the operation of getting small files for partitions when 
constructing the UpsertPartitioner for performance improvement*
   
   ## Brief change log
   
   *(for example:)*
 - *pass through JavaSparkContext*
 - *use RDD for parallelism*
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [x] CI is green
   
- [x] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ffcchi closed pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
ffcchi closed pull request #1421: [HUDI-724] Parallelize getSmallFiles for 
partitions
URL: https://github.com/apache/incubator-hudi/pull/1421
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-03-20 Thread GitBox
lamber-ken commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-601864480
 
 
   I can not understand these create time similar?
   
   
![image](https://user-images.githubusercontent.com/20113411/77197568-64a6cd00-6b20-11ea-8d0f-d7b53d4151bf.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-20 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063573#comment-17063573
 ] 

lamber-ken edited comment on HUDI-716 at 3/20/20, 6:51 PM:
---

Hi [~vbalaji], I agree with your point. from code we can see that create new 
file first, then write content.

 

[https://github.com/apache/incubator-hudi/blob/release-0.5.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java]

!image-2020-03-21-02-45-25-099.png!


was (Author: lamber-ken):
Hi [~vbalaji], I agree with your point. from code we can see that create new 
file first, then write content.

 

HoodieActiveTimeline#createFileInPath

!image-2020-03-21-02-45-25-099.png!

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-02-45-25-099.png
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-03-20 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063574#comment-17063574
 ] 

Balaji Varadarajan commented on HUDI-724:
-

Thanks [~Feichi Feng] for the information. This makes sense to me.

Can you tell us how many partitions did the first insert touched and how many 
files per partitions ? 

For upserts and deletes, the expected reduction in the listing calls is 
proportional to (Number of files touched)/(Number of partitions). How much is 
the speedup of delete with caching on and off ?  

> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-20 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063573#comment-17063573
 ] 

lamber-ken commented on HUDI-716:
-

Hi [~vbalaji], I agree with your point. from code we can see that create new 
file first, then write content.

 

HoodieActiveTimeline#createFileInPath

!image-2020-03-21-02-45-25-099.png!

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-02-45-25-099.png
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-20 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-716:

Attachment: image-2020-03-21-02-45-25-099.png

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-02-45-25-099.png
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] ffcchi commented on issue #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
ffcchi commented on issue #1421: [HUDI-724] Parallelize getSmallFiles for 
partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#issuecomment-601843072
 
 
   integration testing failed after the update. looking. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ffcchi commented on issue #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
ffcchi commented on issue #1421: [HUDI-724] Parallelize getSmallFiles for 
partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#issuecomment-601835210
 
 
   PR updated based on comments. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] zhaomin1423 closed pull request #1429: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
zhaomin1423 closed pull request #1429: [HUDI-65]commitTime rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1429
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] zhedoubushishi commented on a change in pull request #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema

2020-03-20 Thread GitBox
zhedoubushishi commented on a change in pull request #1406: [HUDI-713] Fix 
conversion of Spark array of struct type to Avro schema
URL: https://github.com/apache/incubator-hudi/pull/1406#discussion_r395789120
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -81,10 +82,12 @@
   + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
   + "{\"name\": \"fare\",\"type\": {\"type\":\"record\", 
\"name\":\"fare\",\"fields\": ["
   + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},"
+  + "{\"name\": \"tip_history\", \"type\": {\"type\": \"array\", 
\"items\": {\"type\": \"record\", \"name\": \"tip\", \"fields\": ["
+  + "{\"name\": \"amount\", \"type\": \"double\"}, {\"name\": 
\"currency\", \"type\": \"string\"}]}}},"
 
 Review comment:
   Right. Replaced now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-03-20 Thread Feichi Feng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063537#comment-17063537
 ] 

Feichi Feng commented on HUDI-724:
--

Hi all, regarding to the question why timeline server is not helping:

in my prototype, it's a single spark job, within the spark job, it first does 
inserts, then deletes the old version of the data(due to data modeling, the new 
records and old records are under different Primary Keys). 

so when the spark app starts, it first try to do inserts, while nothing is in 
the timeline server yet. for the inserts, it go through the code path to 
getSmallFiles in the non-parallelized for-loop, which this PR is trying to 
improve. Maybe it's due to that the writes are inserts only, so it didn't go 
through the code path for "bloom index lookup and populate small files to 
timeline server".  

However, with embed timeline server on, the following delete operations ran 
faster, since at that time, the timeline server already have caches that's 
stored by the insert operation. 

> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] zhaomin1423 closed pull request #1428: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
zhaomin1423 closed pull request #1428: [HUDI-65]commitTime rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1428
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] zhaomin1423 commented on issue #1428: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
zhaomin1423 commented on issue #1428: [HUDI-65]commitTime rename to instantTime
URL: https://github.com/apache/incubator-hudi/pull/1428#issuecomment-601811435
 
 
   sorry. there are a problem.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] zhaomin1423 opened a new pull request #1429: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
zhaomin1423 opened a new pull request #1429: [HUDI-65]commitTime rename to 
instantTime
URL: https://github.com/apache/incubator-hudi/pull/1429
 
 
   What is the purpose of the pull request
   Cleanup nomenclature around commitTime and instantTime
   
   Brief change log
   commtTime rename to instantTime
   
   Verify this pull request
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-20 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063493#comment-17063493
 ] 

Balaji Varadarajan commented on HUDI-716:
-

[~lamber-ken]: My understanding is that this is due to crashes when we create 
clean inflight and transition to complete files (in 0.5.0 code). But, It would 
greatly help if you can objectively take a new look at the code and try repro 
why this is happening.

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-20 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063476#comment-17063476
 ] 

lamber-ken commented on HUDI-716:
-

Also, we need to figure it out what jobs created these zero bytes files

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-03-20 Thread Alexander Filipchik (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063474#comment-17063474
 ] 

Alexander Filipchik commented on HUDI-723:
--

{code:java}
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1098)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1092)
at 
org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply$mcD$sp(DoubleRDDFunctions.scala:35)
at 
org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35)
at 
org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at 
org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:34)
at org.apache.spark.api.java.JavaDoubleRDD.sum(JavaDoubleRDD.scala:165)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:387)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:234)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
at org.apache.hudi.utilities.TestCss.testUpsertComplex(TestCss.java:142)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 

[GitHub] [incubator-hudi] bvaradar commented on issue #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-20 Thread GitBox
bvaradar commented on issue #1150: [HUDI-288]: Add support for ingesting 
multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#issuecomment-601773835
 
 
   @pratyakshsharma : Per your previous comment, was waiting for you to give a 
green signal :)  I will look at it sometime today. There are conflicts in this 
PR. If you can resolve them, that will be great. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-20 Thread Alexander Filipchik (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063471#comment-17063471
 ] 

Alexander Filipchik commented on HUDI-721:
--

Serializations works on staging with the fixed. But job can't complete due to: 
HUDI-722

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:47)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
>  

[jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-03-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063470#comment-17063470
 ] 

Vinoth Chandar commented on HUDI-724:
-

+1 Given we have a reproducible setup, let's get to the bottom of this if we 
can please.. TimelineServer helps reduce the listings considerably.. and having 
this working smoothly for S3, means we can enable this by default in the next 
release. 

> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated (eeab532 -> 83fb965)

2020-03-20 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from eeab532  [HUDI-725] Remove init log in the constructor of DeltaSync 
(#1425)
 add 83fb965  [HUDI-650] Modify handleUpdate path to validate partitionPath 
(#1368)

No new revisions were added by this update.

Summary of changes:
 .../execution/MergeOnReadLazyInsertIterable.java   |  6 +-
 .../org/apache/hudi/io/HoodieAppendHandle.java | 20 +++--
 .../org/apache/hudi/io/HoodieCreateHandle.java |  2 +-
 .../java/org/apache/hudi/io/HoodieMergeHandle.java | 25 +++---
 .../java/org/apache/hudi/io/HoodieWriteHandle.java |  5 +-
 .../apache/hudi/table/HoodieCopyOnWriteTable.java  | 39 ++
 .../apache/hudi/table/HoodieMergeOnReadTable.java  |  8 +-
 .../org/apache/hudi/table/WorkloadProfile.java |  4 +
 .../compact/HoodieMergeOnReadTableCompactor.java   |  3 +-
 .../hudi/client/TestUpdateSchemaEvolution.java |  3 +-
 .../apache/hudi/table/TestCopyOnWriteTable.java| 26 +++
 .../apache/hudi/table/TestMergeOnReadTable.java| 90 --
 12 files changed, 170 insertions(+), 61 deletions(-)



[GitHub] [incubator-hudi] bvaradar merged pull request #1368: [HUDI-650] Modify handleUpdate path to validate partitionPath

2020-03-20 Thread GitBox
bvaradar merged pull request #1368: [HUDI-650] Modify handleUpdate path to 
validate partitionPath
URL: https://github.com/apache/incubator-hudi/pull/1368
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-03-20 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063454#comment-17063454
 ] 

Balaji Varadarajan commented on HUDI-724:
-

[~uditme] : The PR also looks good to me too. Regarding speedup with timeline 
server, the cache loading(file-listing) does support concurrency. Can you 
provide the stage times with embedded server turned on.  My belief is that you 
should see reduced time taken in getSmallFiles() as the cache would have 
populated during bloom index lookup. And bloom index lookup calls are also 
parallelized. So, We need to understand why you are not seeing considerable 
improvements. If we can have the cache loading time optimized and caching 
enabled, it would avoid redundant listing calls made during upsert call.

> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-400) Add more checks to TestCompactionUtils#testUpgradeDowngrade

2020-03-20 Thread jerry (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jerry resolved HUDI-400.

Resolution: Fixed

> Add more checks to TestCompactionUtils#testUpgradeDowngrade
> ---
>
> Key: HUDI-400
> URL: https://issues.apache.org/jira/browse/HUDI-400
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie, Testing
>Reporter: leesf
>Assignee: jerry
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, the TestCompactionUtils#testUpgradeDowngrade does not check 
> upgrade from old plan to new plan, it is proper to add some checks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test for ArchivedCommitsCommand

2020-03-20 Thread GitBox
yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test 
for ArchivedCommitsCommand
URL: https://github.com/apache/incubator-hudi/pull/1424#discussion_r395688239
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/common/HoodieTestCommitOperate.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.common;
+
+import com.fasterxml.jackson.databind.DeserializationFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.hudi.avro.model.HoodieWriteStat;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieRollingStatMetadata;
+
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Utility methods to commit instant for test.
+ */
+public class HoodieTestCommitOperate {
 
 Review comment:
   Will we reuse this class in the future for other test cases? If no, can we 
move these utility methods into `TestArchivedCommitsCommand`. If yes, what do 
you think about renaming to `HoodieTestCommitUtilities`. It seems `utility` is 
more clear than `operate` here. WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test for ArchivedCommitsCommand

2020-03-20 Thread GitBox
yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test 
for ArchivedCommitsCommand
URL: https://github.com/apache/incubator-hudi/pull/1424#discussion_r395699029
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/common/HoodieTestCommandDataGenerator.java
 ##
 @@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.common;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.exception.HoodieIOException;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Class to be used in tests to keep generating test inserts and updates 
against a corpus.
+ */
+public class HoodieTestCommandDataGenerator extends HoodieTestDataGenerator {
 
 Review comment:
   Why we need to extends `HoodieTestDataGenerator `? We only reused some 
static field here, right? And wdyt about renaming to 
`HoodieTestCommitMetadataGenerator` based on the implementation of this class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test for ArchivedCommitsCommand

2020-03-20 Thread GitBox
yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test 
for ArchivedCommitsCommand
URL: https://github.com/apache/incubator-hudi/pull/1424#discussion_r395692451
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/common/HoodieTestCommitOperate.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.common;
+
+import com.fasterxml.jackson.databind.DeserializationFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.hudi.avro.model.HoodieWriteStat;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieRollingStatMetadata;
+
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Utility methods to commit instant for test.
+ */
+public class HoodieTestCommitOperate {
+
+  /**
+   * Converter HoodieCommitMetadata to avro format and ordered by partition.
+   */
+  public static org.apache.hudi.avro.model.HoodieCommitMetadata 
commitMetadataConverterOrdered(
+  HoodieCommitMetadata hoodieCommitMetadata) {
+return orderCommitMetadata(commitMetadataConverter(hoodieCommitMetadata));
+  }
+
+  /**
+   * Converter HoodieCommitMetadata to avro format.
+   */
+  public static org.apache.hudi.avro.model.HoodieCommitMetadata 
commitMetadataConverter(
 
 Review comment:
   IIUC, `converter` is a noun while `convert` is a verb.  A method usually 
means a behavior that starts with a verb actively. So considering the function 
of this method, wdyt about renaming to `convertCommitMetadata`? It's an open 
topic, you can share your thought if you have.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test for ArchivedCommitsCommand

2020-03-20 Thread GitBox
yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test 
for ArchivedCommitsCommand
URL: https://github.com/apache/incubator-hudi/pull/1424#discussion_r395700713
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/common/HoodieTestCommandDataGenerator.java
 ##
 @@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.common;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.exception.HoodieIOException;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Class to be used in tests to keep generating test inserts and updates 
against a corpus.
+ */
+public class HoodieTestCommandDataGenerator extends HoodieTestDataGenerator {
+
+  // default commit metadata value
+  public static final String DEFAULT_PATH = "path";
+  public static final String DEFAULT_FILEID = "fileId";
+  public static final int DEFAULT_TOTAL_WRITE_BYTES = 50;
+  public static final String DEFAULT_PRE_COMMIT = "commit-1";
+  public static final int DEFAULT_NUM_WRITES = 10;
+  public static final int DEFAULT_NUM_UPDATE_WRITES = 15;
+  public static final int DEFAULT_TOTAL_LOG_BLOCKS = 1;
+  public static final int DEFAULT_TOTAL_LOG_RECORDS = 10;
+  public static final int DEFAULT_OTHER_VALUE = 0;
+  public static final String DEFAULT_NULL_VALUE = "null";
+
+  /**
+   * Create a commit file with default CommitMetadata.
+   */
+  public static void createCommitFileWithMetadata(String basePath, String 
commitTime, Configuration configuration) {
+Arrays.asList(HoodieTimeline.makeCommitFileName(commitTime), 
HoodieTimeline.makeInflightCommitFileName(commitTime),
+HoodieTimeline.makeRequestedCommitFileName(commitTime))
+.forEach(f -> {
+  Path commitFile = new Path(
+  basePath + "/" + HoodieTableMetaClient.METAFOLDER_NAME + "/" + 
f);
+  FSDataOutputStream os = null;
+  try {
+FileSystem fs = FSUtils.getFs(basePath, configuration);
+os = fs.create(commitFile, true);
+// Generate commitMetadata
+HoodieCommitMetadata commitMetadata = 
generateCommitMetadata(basePath);
+// Write empty commit metadata
+os.writeBytes(new 
String(commitMetadata.toJsonString().getBytes(StandardCharsets.UTF_8)));
+  } catch (IOException ioe) {
+throw new HoodieIOException(ioe.getMessage(), ioe);
+  } finally {
+if (null != os) {
+  try {
+os.close();
+  } catch (IOException e) {
+throw new HoodieIOException(e.getMessage(), e);
+  }
+}
+  }
+});
+  }
+
+  /**
+   * Generate commitMetadata in path.
+   */
+  public static HoodieCommitMetadata generateCommitMetadata(String basePath) 
throws IOException {
+String file1P0C0 =
+HoodieTestUtils.createNewDataFile(basePath, 
DEFAULT_FIRST_PARTITION_PATH, "000");
+String file1P1C0 =
+HoodieTestUtils.createNewDataFile(basePath, 
DEFAULT_SECOND_PARTITION_PATH, "000");
+return generateCommitMetadata(new ImmutableMap.Builder()
+  .put(DEFAULT_FIRST_PARTITION_PATH, new 
ImmutableList.Builder<>().add(file1P0C0).build())
+  .put(DEFAULT_SECOND_PARTITION_PATH, new 
ImmutableList.Builder<>().add(file1P1C0).build())
+  .build());
+  }
+
+
+
+  /**
+   * Method to generate commit metadata.
+   */
+  public static HoodieCommitMetadata generateCommitMetadata(Map> 

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test for ArchivedCommitsCommand

2020-03-20 Thread GitBox
yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test 
for ArchivedCommitsCommand
URL: https://github.com/apache/incubator-hudi/pull/1424#discussion_r395704673
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestArchivedCommitsCommand.java
 ##
 @@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.cli.AbstractShellIntegrationTest;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.TableHeader;
+import org.apache.hudi.cli.common.HoodieTestCommandDataGenerator;
+import org.apache.hudi.cli.common.HoodieTestCommitOperate;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieCommitArchiveLog;
+
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import org.springframework.shell.core.CommandResult;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+/**
+ * Test Cases for {@link ArchivedCommitsCommand}.
+ */
+public class TestArchivedCommitsCommand extends AbstractShellIntegrationTest {
+
+  private String tablePath;
+
+  @Before
+  public void init() throws IOException {
+initDFS();
+jsc.hadoopConfiguration().addResource(dfs.getConf());
+HoodieCLI.conf = dfs.getConf();
+
+// Create table and connect
+String tableName = "test_table";
+tablePath = basePath + File.separator + tableName;
+new TableCommand().createTable(
+tablePath, tableName,
+"COPY_ON_WRITE", "", 1, 
"org.apache.hudi.common.model.HoodieAvroPayload");
+
+metaClient = HoodieCLI.getTableMetaClient();
+
+// Generate archive
+HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder().withPath(tablePath)
+
.withSchema(HoodieTestCommandDataGenerator.TRIP_EXAMPLE_SCHEMA).withParallelism(2,
 2)
+
.withCompactionConfig(HoodieCompactionConfig.newBuilder().retainCommits(1).archiveCommitsWith(2,
 3).build())
+.forTable("test-trip-table").build();
+
+// Create six commits
+for (int i = 100; i < 106; i++) {
+  String timestamp = String.valueOf(i);
+  // Requested Compaction
+  
HoodieTestCommandDataGenerator.createCompactionAuxiliaryMetadata(tablePath,
+  new HoodieInstant(HoodieInstant.State.REQUESTED, 
HoodieTimeline.COMPACTION_ACTION, timestamp), dfs.getConf());
+  // Inflight Compaction
+  
HoodieTestCommandDataGenerator.createCompactionAuxiliaryMetadata(tablePath,
+  new HoodieInstant(HoodieInstant.State.INFLIGHT, 
HoodieTimeline.COMPACTION_ACTION, timestamp), dfs.getConf());
+  HoodieTestCommandDataGenerator.createCommitFileWithMetadata(tablePath, 
timestamp, dfs.getConf());
+}
+
+metaClient = HoodieTableMetaClient.reload(metaClient);
+// reload the timeline and get all the commits before archive
+HoodieTimeline timeline = 
metaClient.getActiveTimeline().reload().getAllCommitsTimeline().filterCompletedInstants();
+assertEquals("Loaded 6 commits and the count should match", 6, 
timeline.countInstants());
+
+// archive
+HoodieCommitArchiveLog archiveLog = new HoodieCommitArchiveLog(cfg, 
metaClient);
+assertTrue(archiveLog.archiveIfRequired(jsc));
+  }
+
+  @After
+  public void clean() throws IOException {
+cleanupDFS();
+  }
+
+  /**
+   * Test for command: show archived commit stats.
+   */
+  @Test
+  public void testShowArchivedCommits() {
+CommandResult cr = getShell().executeCommand("show archived commit stats");
+assertTrue(cr.isSuccess());
+
+TableHeader header = new 
TableHeader().addTableHeaderField("action").addTableHeaderField("instant")
+

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test for ArchivedCommitsCommand

2020-03-20 Thread GitBox
yanghua commented on a change in pull request #1424: [HUDI-697]Add unit test 
for ArchivedCommitsCommand
URL: https://github.com/apache/incubator-hudi/pull/1424#discussion_r395692986
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/common/HoodieTestCommitOperate.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.common;
+
+import com.fasterxml.jackson.databind.DeserializationFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.hudi.avro.model.HoodieWriteStat;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieRollingStatMetadata;
+
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Utility methods to commit instant for test.
+ */
+public class HoodieTestCommitOperate {
+
+  /**
+   * Converter HoodieCommitMetadata to avro format and ordered by partition.
+   */
+  public static org.apache.hudi.avro.model.HoodieCommitMetadata 
commitMetadataConverterOrdered(
 
 Review comment:
   same thought about `converter` see other relevant comments in this class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-65) Cleanup nomenclature around commitTime and instantTime #256

2020-03-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-65:
---
Labels: pull-request-available  (was: )

> Cleanup nomenclature around commitTime and instantTime #256
> ---
>
> Key: HUDI-65
> URL: https://issues.apache.org/jira/browse/HUDI-65
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup, newbie, Writer Core
>Reporter: Vinoth Chandar
>Assignee: jerry
>Priority: Major
>  Labels: pull-request-available
>
> We seem to use "commitTime" as a variable name in many places, where the 
> action is not "COMMIT". They should be renamed to "instantTime" 
> [https://github.com/uber/hudi/issues/256]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
bvaradar commented on a change in pull request #1421: [HUDI-724] Parallelize 
getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395694178
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
 ##
 @@ -486,11 +486,11 @@ private void 
saveWorkloadProfileMetadataToInflight(WorkloadProfile profile, Hood
 return updateIndexAndCommitIfNeeded(writeStatusRDD, hoodieTable, 
commitTime);
   }
 
-  private Partitioner getPartitioner(HoodieTable table, boolean isUpsert, 
WorkloadProfile profile) {
+  private Partitioner getPartitioner(HoodieTable table, boolean isUpsert, 
WorkloadProfile profile, JavaSparkContext jsc) {
 
 Review comment:
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] zhaomin1423 opened a new pull request #1428: [HUDI-65]commitTime rename to instantTime

2020-03-20 Thread GitBox
zhaomin1423 opened a new pull request #1428: [HUDI-65]commitTime rename to 
instantTime
URL: https://github.com/apache/incubator-hudi/pull/1428
 
 
   
   ## What is the purpose of the pull request
   
   Cleanup nomenclature around commitTime and instantTime 
   
   ## Brief change log
   
   commtTime rename to instantTime
   
   ## Verify this pull request
   
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ffcchi commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
ffcchi commented on a change in pull request #1421: [HUDI-724] Parallelize 
getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395664211
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
 ##
 @@ -486,11 +486,11 @@ private void 
saveWorkloadProfileMetadataToInflight(WorkloadProfile profile, Hood
 return updateIndexAndCommitIfNeeded(writeStatusRDD, hoodieTable, 
commitTime);
   }
 
-  private Partitioner getPartitioner(HoodieTable table, boolean isUpsert, 
WorkloadProfile profile) {
+  private Partitioner getPartitioner(HoodieTable table, boolean isUpsert, 
WorkloadProfile profile, JavaSparkContext jsc) {
 
 Review comment:
   nice catch, the HoodieWriterClient do have jsc already.  will update


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-20 Thread GitBox
pratyakshsharma commented on issue #1150: [HUDI-288]: Add support for ingesting 
multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#issuecomment-601717380
 
 
   @bvaradar got chance to take a pass?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-728) Support for complex record keys with TimestampBasedKeyGenerator

2020-03-20 Thread Pratyaksh Sharma (Jira)
Pratyaksh Sharma created HUDI-728:
-

 Summary: Support for complex record keys with 
TimestampBasedKeyGenerator
 Key: HUDI-728
 URL: https://issues.apache.org/jira/browse/HUDI-728
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: DeltaStreamer, Usability
Reporter: Pratyaksh Sharma
Assignee: Pratyaksh Sharma
 Fix For: 0.6.0


We have TimestampBasedKeyGenerator for defining custom partition paths and we 
have ComplexKeyGenerator for supporting having combination of fields as record 
key or partition key. 
 
However we do not have support for the case where one wants to have combination 
of fields as record key along with being able to define custom partition paths. 
This use case recently came up at my organisation. 
 
How about having CustomTimestampBasedKeyGenerator which supports the above use 
case? This class can simply extend TimestampBasedKeyGenerator and allow users 
to have combination of fields as record key.
 
We will try to have the implementation as generic as possible. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-65) Cleanup nomenclature around commitTime and instantTime #256

2020-03-20 Thread jerry (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jerry reassigned HUDI-65:
-

Assignee: jerry

> Cleanup nomenclature around commitTime and instantTime #256
> ---
>
> Key: HUDI-65
> URL: https://issues.apache.org/jira/browse/HUDI-65
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup, newbie, Writer Core
>Reporter: Vinoth Chandar
>Assignee: jerry
>Priority: Major
>
> We seem to use "commitTime" as a variable name in many places, where the 
> action is not "COMMIT". They should be renamed to "instantTime" 
> [https://github.com/uber/hudi/issues/256]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-65) Cleanup nomenclature around commitTime and instantTime #256

2020-03-20 Thread jerry (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jerry updated HUDI-65:
--
Status: In Progress  (was: Open)

> Cleanup nomenclature around commitTime and instantTime #256
> ---
>
> Key: HUDI-65
> URL: https://issues.apache.org/jira/browse/HUDI-65
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup, newbie, Writer Core
>Reporter: Vinoth Chandar
>Assignee: jerry
>Priority: Major
>
> We seem to use "commitTime" as a variable name in many places, where the 
> action is not "COMMIT". They should be renamed to "instantTime" 
> [https://github.com/uber/hudi/issues/256]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-727) Copy default values of fields if not present when rewriting incoming record with new schema

2020-03-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-727:

Labels: pull-request-available  (was: )

> Copy default values of fields if not present when rewriting incoming record 
> with new schema
> ---
>
> Key: HUDI-727
> URL: https://issues.apache.org/jira/browse/HUDI-727
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currently we recommend users to evolve schema in backwards compatible way. 
> When one is trying to evolve schema in backwards compatible way, one of the 
> most significant things to do is to define default value for newly added 
> columns so that records published with previous schema also can be consumed 
> properly. 
>  
> However just before actually writing record to Hudi dataset, we try to 
> rewrite record with new Avro schema which has Hudi metadata columns [1]. In 
> this function, we are only trying to get the values from record without 
> considering field's default value. As a result, schema validation fails. 
> IMO, this piece of code should take into account default value as well in 
> case field's actual value is null. 
>  
> [1] 
> [https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nsivabalan commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-20 Thread GitBox
nsivabalan commented on issue #1176: [HUDI-430] Adding InlineFileSystem to 
support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-601649921
 
 
   @vinothchandar : have addressed your comments for the most part. I will wait 
for 2 days for you to have a final look. If not, will merge it myself as per 
your suggestion. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
umehrot2 commented on a change in pull request #1421: [HUDI-724] Parallelize 
getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395545539
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
 ##
 @@ -486,11 +486,11 @@ private void 
saveWorkloadProfileMetadataToInflight(WorkloadProfile profile, Hood
 return updateIndexAndCommitIfNeeded(writeStatusRDD, hoodieTable, 
commitTime);
   }
 
-  private Partitioner getPartitioner(HoodieTable table, boolean isUpsert, 
WorkloadProfile profile) {
+  private Partitioner getPartitioner(HoodieTable table, boolean isUpsert, 
WorkloadProfile profile, JavaSparkContext jsc) {
 
 Review comment:
   Do we really need all this passing around of `jsc` object ? We can just 
directly pass it from within this function right, as its inherited.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
umehrot2 commented on a change in pull request #1421: [HUDI-724] Parallelize 
getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395553703
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -602,18 +602,39 @@ private int addUpdateBucket(String fileIdHint) {
   return bucket;
 }
 
-private void assignInserts(WorkloadProfile profile) {
+private void assignInserts(WorkloadProfile profile, JavaSparkContext jsc) {
   // for new inserts, compute buckets depending on how many records we 
have for each partition
   Set partitionPaths = profile.getPartitionPaths();
   long averageRecordSize =
   
averageBytesPerRecord(metaClient.getActiveTimeline().getCommitTimeline().filterCompletedInstants(),
   config.getCopyOnWriteRecordSizeEstimate());
   LOG.info("AvgRecordSize => " + averageRecordSize);
+
+  HashMap> partitionSmallFilesMap = new 
HashMap<>();
+  if (jsc != null && partitionPaths.size() > 1) {
+//Parellelize the GetSmallFile Operation by using RDDs
 
 Review comment:
   nit: probably remove this comment


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-20 Thread GitBox
umehrot2 commented on a change in pull request #1421: [HUDI-724] Parallelize 
getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395562779
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -602,18 +602,39 @@ private int addUpdateBucket(String fileIdHint) {
   return bucket;
 }
 
-private void assignInserts(WorkloadProfile profile) {
+private void assignInserts(WorkloadProfile profile, JavaSparkContext jsc) {
   // for new inserts, compute buckets depending on how many records we 
have for each partition
   Set partitionPaths = profile.getPartitionPaths();
   long averageRecordSize =
   
averageBytesPerRecord(metaClient.getActiveTimeline().getCommitTimeline().filterCompletedInstants(),
   config.getCopyOnWriteRecordSizeEstimate());
   LOG.info("AvgRecordSize => " + averageRecordSize);
+
+  HashMap> partitionSmallFilesMap = new 
HashMap<>();
+  if (jsc != null && partitionPaths.size() > 1) {
+//Parellelize the GetSmallFile Operation by using RDDs
+List partitionPathsList = new ArrayList<>(partitionPaths);
+JavaRDD partitionPathRdds = 
jsc.parallelize(partitionPathsList, partitionPathsList.size());
+List>> partitionSmallFileTuples =
+partitionPathRdds.map(it -> new Tuple2>(it, getSmallFiles(it))).collect();
+
+for (Tuple2> tuple : partitionSmallFileTuples) 
{
+  partitionSmallFilesMap.put(tuple._1, tuple._2);
+}
 
 Review comment:
   You may want to refactor this to something like:
   ```
   partitionSmallFilesMap = partitionPathRdds.mapToPair((PairFunction>) 
 partitionPath -> new Tuple2<>(partitionPath, 
getSmallFiles(partitionPath))).collectAsMap();
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-03-20 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063261#comment-17063261
 ] 

Pratyaksh Sharma commented on HUDI-723:
---

Can you please share the stacktrace? 

> SqlTransformer's schema sometimes is not registered. 
> -
>
> Key: HUDI-723
> URL: https://issues.apache.org/jira/browse/HUDI-723
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> If schema is inferred from RowBasedSchemaProvider when SQL transformer is 
> used it also needs to be registered. 
>  
> Current way only works if SchemaProvider has a valid target schema. Is one 
> wants to use schema from SQL transformation, the result of 
> RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
> {code:java}
> private void setupWriteClient(SchemaProvider schemaProvider) {
>   LOG.info("Setting up Hoodie Write Client");
>   registerAvroSchemas(schemaProvider);
>   HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
>   writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
>   onInitializingHoodieWriteClient.apply(writeClient);
> }
> {code}
> Existent method will not work as it is checking for:
> {code:java}
> if ((null != schemaProvider) && (null == writeClient)) {
> {code}
> and writeClient is already configured. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-727) Copy default values of fields if not present when rewriting incoming record with new schema

2020-03-20 Thread Pratyaksh Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma updated HUDI-727:
--
Status: In Progress  (was: Open)

> Copy default values of fields if not present when rewriting incoming record 
> with new schema
> ---
>
> Key: HUDI-727
> URL: https://issues.apache.org/jira/browse/HUDI-727
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.6.0
>
>
> Currently we recommend users to evolve schema in backwards compatible way. 
> When one is trying to evolve schema in backwards compatible way, one of the 
> most significant things to do is to define default value for newly added 
> columns so that records published with previous schema also can be consumed 
> properly. 
>  
> However just before actually writing record to Hudi dataset, we try to 
> rewrite record with new Avro schema which has Hudi metadata columns [1]. In 
> this function, we are only trying to get the values from record without 
> considering field's default value. As a result, schema validation fails. 
> IMO, this piece of code should take into account default value as well in 
> case field's actual value is null. 
>  
> [1] 
> [https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-725) Remove or rewrite init log in the constructor of DeltaSync

2020-03-20 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-725.
-
Fix Version/s: 0.6.0
   Resolution: Done

Done via master branch: eeab532d794426115f839e6ee11a9fc1314698fe

> Remove or rewrite  init log in the constructor of DeltaSync
> ---
>
> Key: HUDI-725
> URL: https://issues.apache.org/jira/browse/HUDI-725
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When initializing HoodieDeltaStreamer, DeltaSyncService and DeltaSync are 
> initialized in turn. Both of which printed the init config log, the same log.
> org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:363 - 
> LOG.info("Creating delta streamer with configs : " + props.toString());
> org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 - 
> LOG.info("Creating delta streamer with configs : " + props.toString());
> So, I think the log at 
> org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 can be removed or 
> rewrited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated: [HUDI-725] Remove init log in the constructor of DeltaSync (#1425)

2020-03-20 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new eeab532  [HUDI-725] Remove init log in the constructor of DeltaSync 
(#1425)
eeab532 is described below

commit eeab532d794426115f839e6ee11a9fc1314698fe
Author: Mathieu 
AuthorDate: Fri Mar 20 17:47:59 2020 +0800

[HUDI-725] Remove init log in the constructor of DeltaSync (#1425)
---
 .../src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java | 1 -
 1 file changed, 1 deletion(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
index 3073dfa..b8524ba 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
@@ -168,7 +168,6 @@ public class DeltaSync implements Serializable {
 this.tableType = tableType;
 this.onInitializingHoodieWriteClient = onInitializingHoodieWriteClient;
 this.props = props;
-LOG.info("Creating delta streamer with configs : " + props.toString());
 this.schemaProvider = schemaProvider;
 
 refreshTimeline();



[GitHub] [incubator-hudi] yanghua merged pull request #1425: [HUDI-725] Remove init log in the constructor of DeltaSync

2020-03-20 Thread GitBox
yanghua merged pull request #1425: [HUDI-725] Remove init log in the 
constructor of DeltaSync
URL: https://github.com/apache/incubator-hudi/pull/1425
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Closed] (HUDI-726) Delete unused method in HoodieDeltaStreamer

2020-03-20 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-726.
-
Fix Version/s: 0.6.0
   Resolution: Done

Done via master branch: 21c45e1051b593f0e1023a84cb96658320046dae

> Delete unused method in HoodieDeltaStreamer
> ---
>
> Key: HUDI-726
> URL: https://issues.apache.org/jira/browse/HUDI-726
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It seems that this method 
> 'org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#getDeltaSyncService'
>  has never been used.
> Can we delete it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated: [HUDI-726]Delete unused method in HoodieDeltaStreamer (#1426)

2020-03-20 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 21c45e1  [HUDI-726]Delete unused method in HoodieDeltaStreamer (#1426)
21c45e1 is described below

commit 21c45e1051b593f0e1023a84cb96658320046dae
Author: Mathieu 
AuthorDate: Fri Mar 20 17:44:16 2020 +0800

[HUDI-726]Delete unused method in HoodieDeltaStreamer (#1426)
---
 .../org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java  | 4 
 1 file changed, 4 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
index 01ab1cc..bff2b41 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
@@ -576,8 +576,4 @@ public class HoodieDeltaStreamer implements Serializable {
   }, executor)).toArray(CompletableFuture[]::new)), executor);
 }
   }
-
-  public DeltaSyncService getDeltaSyncService() {
-return deltaSyncService;
-  }
 }



[GitHub] [incubator-hudi] yanghua merged pull request #1426: [HUDI-726]Delete unused method in HoodieDeltaStreamer

2020-03-20 Thread GitBox
yanghua merged pull request #1426: [HUDI-726]Delete unused method in 
HoodieDeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1426
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong edited a comment on issue #1420: Broken Maven dependencies.

2020-03-20 Thread GitBox
hddong edited a comment on issue #1420: Broken Maven dependencies.
URL: https://github.com/apache/incubator-hudi/issues/1420#issuecomment-601610004
 
 
   @deabreu this is a dev version, you can use `mvn clean package -DskipTests 
-DskipITs -Pspark-shade-unbundle-avro` in root of hudi project to generate them 
locally.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong commented on issue #1420: Broken Maven dependencies.

2020-03-20 Thread GitBox
hddong commented on issue #1420: Broken Maven dependencies.
URL: https://github.com/apache/incubator-hudi/issues/1420#issuecomment-601610004
 
 
   @deabreu this is a dev branch, you can use `mvn clean package -DskipTests 
-DskipITs -Pspark-shade-unbundle-avro` in root of hudi project to generate them 
locally.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong commented on issue #1424: [HUDI-697]Add unit test for ArchivedCommitsCommand

2020-03-20 Thread GitBox
hddong commented on issue #1424: [HUDI-697]Add unit test for 
ArchivedCommitsCommand
URL: https://github.com/apache/incubator-hudi/pull/1424#issuecomment-601607599
 
 
   @yanghua @vinothchandar please have a review when free.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-03-20 Thread Alexander Filipchik (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063214#comment-17063214
 ] 

Alexander Filipchik commented on HUDI-723:
--

current implementation just doesn't work for this case. if you use 
NullTargetSchemaRegistryProvider you will get a NPE, as schema will not be 
registered.

> SqlTransformer's schema sometimes is not registered. 
> -
>
> Key: HUDI-723
> URL: https://issues.apache.org/jira/browse/HUDI-723
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> If schema is inferred from RowBasedSchemaProvider when SQL transformer is 
> used it also needs to be registered. 
>  
> Current way only works if SchemaProvider has a valid target schema. Is one 
> wants to use schema from SQL transformation, the result of 
> RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
> {code:java}
> private void setupWriteClient(SchemaProvider schemaProvider) {
>   LOG.info("Setting up Hoodie Write Client");
>   registerAvroSchemas(schemaProvider);
>   HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
>   writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
>   onInitializingHoodieWriteClient.apply(writeClient);
> }
> {code}
> Existent method will not work as it is checking for:
> {code:java}
> if ((null != schemaProvider) && (null == writeClient)) {
> {code}
> and writeClient is already configured. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-20 Thread Alexander Filipchik (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063208#comment-17063208
 ] 

Alexander Filipchik edited comment on HUDI-721 at 3/20/20, 9:11 AM:


looks like it fixed the issue. Local tests ran ok, will deploy to staging and 
validate tomorrow.


was (Author: afilipchik):
looks like it fixed the issue.

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
>   at 
> 

[jira] [Commented] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-20 Thread Alexander Filipchik (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063208#comment-17063208
 ] 

Alexander Filipchik commented on HUDI-721:
--

looks like it fixed the issue.

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:47)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
>   at 
> 

[jira] [Created] (HUDI-727) Copy default values of fields if not present when rewriting incoming record with new schema

2020-03-20 Thread Pratyaksh Sharma (Jira)
Pratyaksh Sharma created HUDI-727:
-

 Summary: Copy default values of fields if not present when 
rewriting incoming record with new schema
 Key: HUDI-727
 URL: https://issues.apache.org/jira/browse/HUDI-727
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Utilities
Reporter: Pratyaksh Sharma
Assignee: Pratyaksh Sharma
 Fix For: 0.6.0


Currently we recommend users to evolve schema in backwards compatible way. When 
one is trying to evolve schema in backwards compatible way, one of the most 
significant things to do is to define default value for newly added columns so 
that records published with previous schema also can be consumed properly. 
 
However just before actually writing record to Hudi dataset, we try to rewrite 
record with new Avro schema which has Hudi metadata columns [1]. In this 
function, we are only trying to get the values from record without considering 
field's default value. As a result, schema validation fails. 
IMO, this piece of code should take into account default value as well in case 
field's actual value is null. 
 
[1] 
[https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-727) Copy default values of fields if not present when rewriting incoming record with new schema

2020-03-20 Thread Pratyaksh Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma updated HUDI-727:
--
Status: Open  (was: New)

> Copy default values of fields if not present when rewriting incoming record 
> with new schema
> ---
>
> Key: HUDI-727
> URL: https://issues.apache.org/jira/browse/HUDI-727
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.6.0
>
>
> Currently we recommend users to evolve schema in backwards compatible way. 
> When one is trying to evolve schema in backwards compatible way, one of the 
> most significant things to do is to define default value for newly added 
> columns so that records published with previous schema also can be consumed 
> properly. 
>  
> However just before actually writing record to Hudi dataset, we try to 
> rewrite record with new Avro schema which has Hudi metadata columns [1]. In 
> this function, we are only trying to get the values from record without 
> considering field's default value. As a result, schema validation fails. 
> IMO, this piece of code should take into account default value as well in 
> case field's actual value is null. 
>  
> [1] 
> [https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] wangxianghu commented on issue #1425: [HUDI-725] Remove init log in the constructor of DeltaSync

2020-03-20 Thread GitBox
wangxianghu commented on issue #1425: [HUDI-725] Remove init log in the 
constructor of DeltaSync
URL: https://github.com/apache/incubator-hudi/pull/1425#issuecomment-601590371
 
 
   @yanghua  would you please take a look


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] wangxianghu commented on issue #1426: [HUDI-726]Delete unused method in HoodieDeltaStreamer

2020-03-20 Thread GitBox
wangxianghu commented on issue #1426: [HUDI-726]Delete unused method in 
HoodieDeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1426#issuecomment-601589598
 
 
   @yanghua  would you please take a look


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-726) Delete unused method in HoodieDeltaStreamer

2020-03-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-726:

Labels: pull-request-available  (was: )

> Delete unused method in HoodieDeltaStreamer
> ---
>
> Key: HUDI-726
> URL: https://issues.apache.org/jira/browse/HUDI-726
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Minor
>  Labels: pull-request-available
>
> It seems that this method 
> 'org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#getDeltaSyncService'
>  has never been used.
> Can we delete it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] wangxianghu opened a new pull request #1426: [HUDI-726]Delete unused method in HoodieDeltaStreamer

2020-03-20 Thread GitBox
wangxianghu opened a new pull request #1426: [HUDI-726]Delete unused method in 
HoodieDeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1426
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Delete unused method in HoodieDeltaStreamer*
   
   ## Brief change log
   
   *Delete unused method in HoodieDeltaStreamer*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-725) Remove or rewrite init log in the constructor of DeltaSync

2020-03-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-725:

Labels: pull-request-available  (was: )

> Remove or rewrite  init log in the constructor of DeltaSync
> ---
>
> Key: HUDI-725
> URL: https://issues.apache.org/jira/browse/HUDI-725
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Trivial
>  Labels: pull-request-available
>
> When initializing HoodieDeltaStreamer, DeltaSyncService and DeltaSync are 
> initialized in turn. Both of which printed the init config log, the same log.
> org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:363 - 
> LOG.info("Creating delta streamer with configs : " + props.toString());
> org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 - 
> LOG.info("Creating delta streamer with configs : " + props.toString());
> So, I think the log at 
> org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 can be removed or 
> rewrited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] wangxianghu opened a new pull request #1425: [HUDI-725] Remove init log in the constructor of DeltaSync

2020-03-20 Thread GitBox
wangxianghu opened a new pull request #1425: [HUDI-725] Remove init log in the 
constructor of DeltaSync
URL: https://github.com/apache/incubator-hudi/pull/1425
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Remove init log in the constructor of DeltaSync*
   
   ## Brief change log
   
   *Remove init log in the constructor of DeltaSync*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] codecov-io commented on issue #1424: [HUDI-697]Add unit test for ArchivedCommitsCommand

2020-03-20 Thread GitBox
codecov-io commented on issue #1424: [HUDI-697]Add unit test for 
ArchivedCommitsCommand
URL: https://github.com/apache/incubator-hudi/pull/1424#issuecomment-601586198
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1424?src=pr=h1) 
Report
   > Merging 
[#1424](https://codecov.io/gh/apache/incubator-hudi/pull/1424?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/14e0c95206f6d7c1806555490bcbce8785ffea5a=desc)
 will **decrease** coverage by `0.04%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1424/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1424?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1424  +/-   ##
   
   - Coverage 67.58%   67.54%   -0.05% 
   + Complexity  255  253   -2 
   
 Files   340  340  
 Lines 1649916499  
 Branches   1687 1687  
   
   - Hits  1115111144   -7 
   - Misses 4611 4618   +7 
 Partials737  737  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1424?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1424/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `40.00% <0.00%> (-60.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1424/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `56.75% <0.00%> (-13.52%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...che/hudi/common/util/BufferedRandomAccessFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1424/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQnVmZmVyZWRSYW5kb21BY2Nlc3NGaWxlLmphdmE=)
 | `55.26% <0.00%> (+0.87%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1424?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1424?src=pr=footer).
 Last update 
[14e0c95...92d1ff3](https://codecov.io/gh/apache/incubator-hudi/pull/1424?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-03-20 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063166#comment-17063166
 ] 

Pratyaksh Sharma commented on HUDI-723:
---

Are you facing any specific error because of this? Otherwise the code handles 
the corner cases where targetSchema or schemaProvider is null I believe. 
[~afilipchik]

> SqlTransformer's schema sometimes is not registered. 
> -
>
> Key: HUDI-723
> URL: https://issues.apache.org/jira/browse/HUDI-723
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> If schema is inferred from RowBasedSchemaProvider when SQL transformer is 
> used it also needs to be registered. 
>  
> Current way only works if SchemaProvider has a valid target schema. Is one 
> wants to use schema from SQL transformation, the result of 
> RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
> {code:java}
> private void setupWriteClient(SchemaProvider schemaProvider) {
>   LOG.info("Setting up Hoodie Write Client");
>   registerAvroSchemas(schemaProvider);
>   HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
>   writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
>   onInitializingHoodieWriteClient.apply(writeClient);
> }
> {code}
> Existent method will not work as it is checking for:
> {code:java}
> if ((null != schemaProvider) && (null == writeClient)) {
> {code}
> and writeClient is already configured. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >