Build failed in Jenkins: hudi-snapshot-deployment-0.5 #244

2020-04-10 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.36 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[jira] [Updated] (HUDI-700) Add unit test for FileSystemViewCommand

2020-04-10 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-700:
--
Status: Open  (was: New)

> Add unit test for FileSystemViewCommand
> ---
>
> Key: HUDI-700
> URL: https://issues.apache.org/jira/browse/HUDI-700
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-700) Add unit test for FileSystemViewCommand

2020-04-10 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-700:
--
Fix Version/s: 0.6.0

> Add unit test for FileSystemViewCommand
> ---
>
> Key: HUDI-700
> URL: https://issues.apache.org/jira/browse/HUDI-700
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-700) Add unit test for FileSystemViewCommand

2020-04-10 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-700.
-
Resolution: Done

Done via master branch: a464a2972e83e648585277ebe567703c8285cf1e

> Add unit test for FileSystemViewCommand
> ---
>
> Key: HUDI-700
> URL: https://issues.apache.org/jira/browse/HUDI-700
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated: [HUDI-700]Add unit test for FileSystemViewCommand (#1490)

2020-04-10 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new a464a29  [HUDI-700]Add unit test for FileSystemViewCommand (#1490)
a464a29 is described below

commit a464a2972e83e648585277ebe567703c8285cf1e
Author: hongdd 
AuthorDate: Sat Apr 11 10:12:21 2020 +0800

[HUDI-700]Add unit test for FileSystemViewCommand (#1490)
---
 .../apache/hudi/cli/HoodieTableHeaderFields.java   |  56 +
 .../hudi/cli/commands/FileSystemViewCommand.java   |  47 ++--
 .../cli/commands/TestFileSystemViewCommand.java| 267 +
 3 files changed, 351 insertions(+), 19 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
new file mode 100644
index 000..001a54a
--- /dev/null
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli;
+
+/**
+ * Fields of print table header.
+ */
+public class HoodieTableHeaderFields {
+  public static final String HEADER_PARTITION = "Partition";
+  public static final String HEADER_FILE_ID = "FileId";
+  public static final String HEADER_BASE_INSTANT = "Base-Instant";
+
+  /**
+   * Fields of data header.
+   */
+  public static final String HEADER_DATA_FILE = "Data-File";
+  public static final String HEADER_DATA_FILE_SIZE = HEADER_DATA_FILE + " 
Size";
+
+  /**
+   * Fields of delta header.
+   */
+  public static final String HEADER_DELTA_SIZE = "Delta Size";
+  public static final String HEADER_DELTA_FILES = "Delta Files";
+  public static final String HEADER_TOTAL_DELTA_SIZE = "Total " + 
HEADER_DELTA_SIZE;
+  public static final String HEADER_TOTAL_DELTA_FILE_SIZE = "Total Delta File 
Size";
+  public static final String HEADER_NUM_DELTA_FILES = "Num " + 
HEADER_DELTA_FILES;
+
+  /**
+   * Fields of compaction scheduled header.
+   */
+  private static final String COMPACTION_SCHEDULED_SUFFIX = " - compaction 
scheduled";
+  private static final String COMPACTION_UNSCHEDULED_SUFFIX = " - compaction 
unscheduled";
+
+  public static final String HEADER_DELTA_SIZE_SCHEDULED = HEADER_DELTA_SIZE + 
COMPACTION_SCHEDULED_SUFFIX;
+  public static final String HEADER_DELTA_SIZE_UNSCHEDULED = HEADER_DELTA_SIZE 
+ COMPACTION_UNSCHEDULED_SUFFIX;
+  public static final String HEADER_DELTA_BASE_SCHEDULED = "Delta To Base 
Ratio" + COMPACTION_SCHEDULED_SUFFIX;
+  public static final String HEADER_DELTA_BASE_UNSCHEDULED = "Delta To Base 
Ratio" + COMPACTION_UNSCHEDULED_SUFFIX;
+  public static final String HEADER_DELTA_FILES_SCHEDULED = "Delta Files" + 
COMPACTION_SCHEDULED_SUFFIX;
+  public static final String HEADER_DELTA_FILES_UNSCHEDULED = "Delta Files" + 
COMPACTION_UNSCHEDULED_SUFFIX;
+}
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/FileSystemViewCommand.java
 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/FileSystemViewCommand.java
index a7025f8..cf86184 100644
--- 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/FileSystemViewCommand.java
+++ 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/FileSystemViewCommand.java
@@ -20,6 +20,7 @@ package org.apache.hudi.cli.commands;
 
 import org.apache.hudi.cli.HoodieCLI;
 import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.HoodieTableHeaderFields;
 import org.apache.hudi.cli.TableHeader;
 import org.apache.hudi.common.model.FileSlice;
 import org.apache.hudi.common.model.HoodieLogFile;
@@ -99,14 +100,18 @@ public class FileSystemViewCommand implements 
CommandMarker {
 Function converterFunction =
 entry -> 
NumericUtils.humanReadableByteCount((Double.parseDouble(entry.toString(;
 Map> fieldNameToConverterMap = new 
HashMap<>();
-fieldNameToConverterMap.put("Total Delta File Size", converterFunction);
-fieldNameToConverterMap.put("Data-File Size", converterFunction);
+

[GitHub] [incubator-hudi] yanghua merged pull request #1490: [HUDI-700]Add unit test for FileSystemViewCommand

2020-04-10 Thread GitBox
yanghua merged pull request #1490: [HUDI-700]Add unit test for 
FileSystemViewCommand
URL: https://github.com/apache/incubator-hudi/pull/1490
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-782) Add support for aliyun OSS

2020-04-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-782:

Labels: pull-request-available  (was: )

> Add support for aliyun OSS
> --
>
> Key: HUDI-782
> URL: https://issues.apache.org/jira/browse/HUDI-782
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: leesf
>Assignee: Hong Shen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> aliyun OSS is a wide used Object Storage Service, and many users use OSS as 
> the backend storage system, so we could support the OSS



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua commented on issue #1504: [HUDI-780] Add junit 5

2020-04-10 Thread GitBox
yanghua commented on issue #1504: [HUDI-780] Add junit 5
URL: https://github.com/apache/incubator-hudi/pull/1504#issuecomment-612297922
 
 
   > @yanghua @xushiyan my style has been to only use javadocs when it warrants 
them. forced writing of trivial/obvious docs is not very helpful.. That said, 
public APIs should have detailed/accurate javadocs, test framework/utility 
classes should. Not sure if we want to burn cycles (and increase file lengths) 
just for sake of having javadocs..
   > 
   > I know this is subjective. but atleast goes with few books I have read in 
the past and what made sense to me.. We can also start a separate DISCUSS on 
this and defer..
   
   @vinothchandar  I think we all understand that sometimes class names or 
method names are inherently readable and they can be self-explanatory. Adding 
javadocs to some classes seems a bit redundant.
   
   My more consideration is: Constraint and standardization are conducive to 
the formation of good habits. Because the developers involved in the community 
are diverse. And many developers have a "lazy psychology", that is, without 
such constraints, they may rather not do it, not everyone has a good habit to 
think more. In fact, some test classes may be very simple and do not need to 
add a document description, but some classes may require, even some 
instructions and warnings.
   
   But once we do not enable tool inspection, many contributors will be 
reluctant to do this out of "lazy psychology". In order to enable tool 
inspection, then we must maintain the unity of the entire style. This is my 
practice in the Flink community.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] shenh062326 opened a new pull request #1506: [HUDI-782] Add support of Aliyun object storage service.

2020-04-10 Thread GitBox
shenh062326 opened a new pull request #1506: [HUDI-782] Add support of Aliyun 
object storage service.
URL: https://github.com/apache/incubator-hudi/pull/1506
 
 
   ## What is the purpose of the pull request
   Add support of Aliyun object storage service.
   
   ## Brief change log
 - Modify StorageSchemes add support for oss, add testcase
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081081#comment-17081081
 ] 

Vinoth Chandar commented on HUDI-773:
-

yeah it should be fine.. Few things to check 

1. do we have all the azure storage schemes supported? (StorageSchemes class) 
2. docs on azure support.
3. There are some schemes that support appends.. we need to classify them 
properly in the same StorageSchemes class.. 

if you can verify and close.. that would be awesome :) 

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-773:

Fix Version/s: 0.6.0

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081032#comment-17081032
 ] 

Yanjia Gary Li commented on HUDI-773:
-

Any extra tests needed? What tests have you guys done for AWS and GCP? 
[~vinoth] [~vbalaji]

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] prashantwason edited a comment on issue #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-10 Thread GitBox
prashantwason edited a comment on issue #1457: [HUDI-741] Added checks to 
validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-612262645
 
 
   > What will happen if there is incompatible message in Kafka? Will pipeline 
stall? What will be the way to fix it without purging whole kafka topic?
   @afilipchik 
   
   The current state is that:
   1. COW tables: 
  - Update to existing parquet file: Will raise as exception during commit 
as conversion of record to the writerSchema will fail. 
  - Insert to new parquet file: Will be ok.
   2. MOR Table:
  - Update and insert both will be successful. But will raise exception 
during compaction.
   
   I am not very sure on the reader side. Either an exception or the record may 
be missing the fields.
   
   So even today, the pipeline may stall (due to exception). I dont think HUDI 
has a way out of it yet. You may drop the offending record (before calling 
HoodieWriteClient::insert()).
   
   This change only checks the schema. So if the writerSchema is same, then 
this code has no extra effect.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason edited a comment on issue #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-10 Thread GitBox
prashantwason edited a comment on issue #1457: [HUDI-741] Added checks to 
validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-612258453
 
 
   > Structure looks much better now. thanks @prashantwason ..
   > 
   > I raised an issue on the need to copy the avro compatibility code into the 
project.. Would like to understand why we cannot re-use as is.. I don't know if 
we can maintain this and keep in sync over time..
   > 
   > Nonetheless. this change also needs to update NOTICE/LICENSE appropriately 
as well, if we need to reuse that code
   
   @vinothchandar 
   
   Please see the details on 
[HUDI-741](https://issues.apache.org/jira/browse/HUDI-741?focusedCommentId=17081025=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17081025)
 for the limitation with the original code. 
   
   This is just one way to compare two schemas. If there is a better way for 
HUDI, then I will be happy to integrate that instead. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on issue #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-10 Thread GitBox
prashantwason commented on issue #1457: [HUDI-741] Added checks to validate 
Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-612262645
 
 
   > What will happen if there is incompatible message in Kafka? Will pipeline 
stall? What will be the way to fix it without purging whole kafka topic?
   
   The current state is that:
   1. COW tables: 
  - Update to existing parquet file: Will raise as exception during commit 
as conversion of record to the writerSchema will fail. 
  - Insert to new parquet file: Will be ok.
   2. MOR Table:
  - Update and insert both will be successful. But will raise exception 
during compaction.
   
   I am not very sure on the reader side. Either an exception or the record may 
be missing the fields.
   
   So even today, the pipeline may stall (due to exception). I dont think HUDI 
has a way out of it yet. You may drop the offending record (before calling 
HoodieWriteClient::insert()).
   
   This change only checks the schema. So if the writerSchema is same, then 
this code has no extra effect.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081030#comment-17081030
 ] 

Yanjia Gary Li commented on HUDI-773:
-

surprisingly easy...I tried the following test using Spark2.4 HDinsigh cluster 
with Azure Data Lake Storage V2. Hudi ran out of the box. No extra config 
needed.
{code:java}
// Initial Batch
val outputPath = "/Test/HudiWrite"
val df1 = Seq(
  ("0", "year=2019", "test1", "pass", "201901"),
  ("1", "year=2019", "test1", "pass", "201901"),
  ("2", "year=2020", "test1", "pass", "201901"),
  ("3", "year=2020", "test1", "pass", "201901")
).toDF("_uuid", "_partition", "PARAM_NAME", "RESULT_STRING", "TIMESTAMP")
val bulk_insert_ops = Map(
  DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_uuid",
  DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "_partition",
  DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL,
  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "TIMESTAMP",
  "hoodie.bulkinsert.shuffle.parallelism" -> "10",
  "hoodie.upsert.shuffle.parallelism" -> "10",
  HoodieWriteConfig.TABLE_NAME -> "test"
)
df1.write.format("org.apache.hudi").options(bulk_insert_ops).mode(SaveMode.Overwrite).save(outputPath)

// Upsert
val upsert_ops = Map(
  DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_uuid",
  DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "_partition",
  DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "TIMESTAMP",
  "hoodie.bulkinsert.shuffle.parallelism" -> "10",
  "hoodie.upsert.shuffle.parallelism" -> "10",
  HoodieWriteConfig.TABLE_NAME -> "test"
)
val df2 = Seq(
  ("0", "year=2019", "test1", "pass", "201910"),
  ("1", "year=2019", "test1", "pass", "201910"),
  ("2", "year=2020", "test1", "pass", "201910"),
  ("3", "year=2020", "test1", "pass", "201910")
).toDF("_uuid", "_partition", "PARAM_NAME", "RESULT_STRING", "TIMESTAMP")
df2.write.format("org.apache.hudi").options(upsert_ops).mode(SaveMode.Append).save(outputPath)

// Read as hudi format
val df_read = 
spark.read.format("org.apache.hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY,
 DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).load(outputPath)
assert(df_read.count() == 4){code}
 

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-773:

Status: In Progress  (was: Open)

> Hudi On Azure Data Lake Storage V2
> --
>
> Key: HUDI-773
> URL: https://issues.apache.org/jira/browse/HUDI-773
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Usability
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-10 Thread GitBox
prashantwason commented on a change in pull request #1457: [HUDI-741] Added 
checks to validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#discussion_r406978802
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
 ##
 @@ -0,0 +1,274 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table;
+
+import java.io.IOException;
+
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.avro.SchemaCompatibility;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.HoodieLogFormat.Reader;
+import org.apache.hudi.common.table.log.block.HoodieAvroDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.InvalidTableException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+/**
+ * Helper class to read schema from data files and log files and to convert it 
between different formats.
+ */
+public class TableSchemaResolver {
+
+  private static final Logger LOG = 
LogManager.getLogger(TableSchemaResolver.class);
+  private HoodieTableMetaClient metaClient;
+
+  public TableSchemaResolver(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+  }
+
+  /**
+   * Gets the schema for a hoodie table. Depending on the type of table, read 
from any file written in the latest
+   * commit. We will assume that the schema has not changed within a single 
atomic write.
+   *
+   * @return Parquet schema for this table
+   * @throws Exception
+   */
+  public MessageType getDataSchema() throws Exception {
 
 Review comment:
   This simplifies the code. I have updated to use the schema from commit 
metadata.
   
   For MOR tables, the compaction operation also leads to a commit (.commit 
extension) which saves HoodieCommitMetadata but without the SCHEMA. I guess 
this is a miss and not as per design. I have fixed this as I test both types of 
tables.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on issue #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-10 Thread GitBox
prashantwason commented on issue #1457: [HUDI-741] Added checks to validate 
Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-612258453
 
 
   > Structure looks much better now. thanks @prashantwason ..
   > 
   > I raised an issue on the need to copy the avro compatibility code into the 
project.. Would like to understand why we cannot re-use as is.. I don't know if 
we can maintain this and keep in sync over time..
   > 
   > Nonetheless. this change also needs to update NOTICE/LICENSE appropriately 
as well, if we need to reuse that code
   
   Please see the details on 
[HUDI-741](https://issues.apache.org/jira/browse/HUDI-741?focusedCommentId=17081025=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17081025)
 for the limitation with the original code. 
   
   This is just one way to compare two schemas. If there is a better way for 
HUDI, then I will be happy to integrate that instead. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks

2020-04-10 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081025#comment-17081025
 ] 

Prashant Wason commented on HUDI-741:
-

HUDI requires a Schema to be specified in HoodieWriteConfig and is used by the 
HoodieWriteClient to
create the records. The schema is also saved in the data files (parquet format) 
and log files (avro format).
Since a schema is required each time new data is ingested into a HUDI dataset, 
schema can be evolved over time.

HUDI specific validation of schema evolution should ensure that a newer schema 
can be used for the dataset by
checking that the data written using the old schema can be read using the new 
schema.

New Schema is compatible only if:
A1. There is no change in schema
A2. A field has been added and it has a default value specified

New Schema is incompatible if:
B1. A field has been deleted
B2. A field has been renamed (treated as delete + add)
B3. A field's type has changed to be incompatible with the older type

*Limitation with org.apache.avro.SchemaCompatibility:*

org.apache.avro.SchemaCompatibility checks schema compatibility between a 
writer schema (which originally wrote the
 AVRO record) and a readerSchema (with which we are reading the record). It 
ONLY guarantees that that each field in
 the reader record can be populated from the writer record. Hence, if the 
reader schema is missing a field, it is
 still compatible with the writer schema.

In other words, org.apache.avro.SchemaCompatibility was written to guarantee 
that we can read the data written
 earlier. It does not guarantee schema evolution for HUDI (B1 above).

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 120h
>  Time Spent: 10m
>  Remaining Estimate: 119h 50m
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-10 Thread GitBox
prashantwason commented on a change in pull request #1457: [HUDI-741] Added 
checks to validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#discussion_r406976510
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/avro/SchemaCompatibility.java
 ##
 @@ -0,0 +1,566 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.avro;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.avro.AvroRuntimeException;
+import org.apache.avro.Schema;
+import org.apache.avro.Schema.Field;
+import org.apache.avro.Schema.Type;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * NOTE: This code is copied from org.apache.avro.SchemaCompatibility and 
changed for HUDI use case.
+ *
+ * HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
the HoodieWriteClient to
+ * create the records. The schema is also saved in the data files (parquet 
format) and log files (avro format).
+ * Since a schema is required each time new data is ingested into a HUDI 
dataset, schema can be evolved over time.
+ *
+ * HUDI specific validation of schema evolution should ensure that a newer 
schema can be used for the dataset by
+ * checking that the data written using the old schema can be read using the 
new schema.
+ *
+ * New Schema is compatible only if:
+ * 1. There is no change in schema
+ * 2. A field has been added and it has a default value specified
+ *
+ * New Schema is incompatible if:
+ * 1. A field has been deleted
+ * 2. A field has been renamed (treated as delete + add)
+ * 3. A field's type has changed to be incompatible with the older type
+ */
+public class SchemaCompatibility {
 
 Review comment:
   Done. I have added the limitation with org.apache.avro.SchemaCompatibility 
in the file as well as in the 
[HUDI-741](https://issues.apache.org/jira/browse/HUDI-741)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-10 Thread GitBox
prashantwason commented on a change in pull request #1457: [HUDI-741] Added 
checks to validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#discussion_r406974059
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/client/TestTableSchemaEvolution.java
 ##
 @@ -0,0 +1,410 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.HoodieClientTestUtils;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieInsertException;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.index.HoodieIndex.IndexType;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.List;
+
+import static 
org.apache.hudi.common.HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA;
+import static 
org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion.VERSION_1;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+
+public class TestTableSchemaEvolution extends TestHoodieClientBase {
+  private final String initCommitTime = "000";
+  private HoodieTableType tableType = HoodieTableType.COPY_ON_WRITE;
+  private HoodieTestDataGenerator dataGenDevolved = new 
HoodieTestDataGenerator(TRIP_EXAMPLE_SCHEMA_DEVOLVED);
+  private HoodieTestDataGenerator dataGenEvolved = new 
HoodieTestDataGenerator(TRIP_EXAMPLE_SCHEMA_EVOLVED);
+
+  // TRIP_EXAMPLE_SCHEMA with a new_field added
+  public static final String TRIP_EXAMPLE_SCHEMA_EVOLVED = "{\"type\": 
\"record\"," + "\"name\": \"triprec\"," + "\"fields\": [ "
+  + "{\"name\": \"timestamp\",\"type\": \"double\"}," + "{\"name\": 
\"_row_key\", \"type\": \"string\"},"
+  + "{\"name\": \"rider\", \"type\": \"string\"}," + "{\"name\": 
\"driver\", \"type\": \"string\"},"
+  + "{\"name\": \"begin_lat\", \"type\": \"double\"}," + "{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"new_field\", \"type\": [\"null\", \"string\"], 
\"default\": null},"
+  + "{\"name\": \"fare\",\"type\": {\"type\":\"record\", 
\"name\":\"fare\",\"fields\": ["
+  + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},"
+  + "{\"name\": \"_hoodie_is_deleted\", \"type\": \"boolean\", 
\"default\": false} ]}";
+  // TRIP_EXAMPLE_SCHEMA with driver field removed
+  public static final String TRIP_EXAMPLE_SCHEMA_DEVOLVED = "{\"type\": 
\"record\"," + "\"name\": \"triprec\"," + "\"fields\": [ "
+  + "{\"name\": \"timestamp\",\"type\": \"double\"}," + "{\"name\": 
\"_row_key\", \"type\": \"string\"},"
+  + "{\"name\": \"rider\", \"type\": \"string\"},"
+  + "{\"name\": \"begin_lat\", \"type\": \"double\"}," + "{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"fare\",\"type\": {\"type\":\"record\", 
\"name\":\"fare\",\"fields\": ["
+  + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},"
+  + "{\"name\": \"_hoodie_is_deleted\", \"type\": \"boolean\", 
\"default\": false} ]}";
+
+  @Before
+  public void setUp() throws Exception {
+initResources();
+  }
+
+  @After
+  public void tearDown() {
+cleanupSparkContexts();
+  }
+
+  @Test
+  public void testMORTable() throws Exception {
+tableType = HoodieTableType.MERGE_ON_READ;
+initMetaClient();
+
+// Create the table
+HoodieTableMetaClient.initTableType(metaClient.getHadoopConf(), 

[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-10 Thread GitBox
prashantwason commented on a change in pull request #1457: [HUDI-741] Added 
checks to validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#discussion_r406973747
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/client/TestTableSchemaEvolution.java
 ##
 @@ -0,0 +1,410 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.HoodieClientTestUtils;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieInsertException;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.index.HoodieIndex.IndexType;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.List;
+
+import static 
org.apache.hudi.common.HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA;
+import static 
org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion.VERSION_1;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+
+public class TestTableSchemaEvolution extends TestHoodieClientBase {
+  private final String initCommitTime = "000";
+  private HoodieTableType tableType = HoodieTableType.COPY_ON_WRITE;
+  private HoodieTestDataGenerator dataGenDevolved = new 
HoodieTestDataGenerator(TRIP_EXAMPLE_SCHEMA_DEVOLVED);
+  private HoodieTestDataGenerator dataGenEvolved = new 
HoodieTestDataGenerator(TRIP_EXAMPLE_SCHEMA_EVOLVED);
+
+  // TRIP_EXAMPLE_SCHEMA with a new_field added
+  public static final String TRIP_EXAMPLE_SCHEMA_EVOLVED = "{\"type\": 
\"record\"," + "\"name\": \"triprec\"," + "\"fields\": [ "
+  + "{\"name\": \"timestamp\",\"type\": \"double\"}," + "{\"name\": 
\"_row_key\", \"type\": \"string\"},"
+  + "{\"name\": \"rider\", \"type\": \"string\"}," + "{\"name\": 
\"driver\", \"type\": \"string\"},"
+  + "{\"name\": \"begin_lat\", \"type\": \"double\"}," + "{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"new_field\", \"type\": [\"null\", \"string\"], 
\"default\": null},"
+  + "{\"name\": \"fare\",\"type\": {\"type\":\"record\", 
\"name\":\"fare\",\"fields\": ["
+  + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},"
+  + "{\"name\": \"_hoodie_is_deleted\", \"type\": \"boolean\", 
\"default\": false} ]}";
+  // TRIP_EXAMPLE_SCHEMA with driver field removed
+  public static final String TRIP_EXAMPLE_SCHEMA_DEVOLVED = "{\"type\": 
\"record\"," + "\"name\": \"triprec\"," + "\"fields\": [ "
+  + "{\"name\": \"timestamp\",\"type\": \"double\"}," + "{\"name\": 
\"_row_key\", \"type\": \"string\"},"
+  + "{\"name\": \"rider\", \"type\": \"string\"},"
+  + "{\"name\": \"begin_lat\", \"type\": \"double\"}," + "{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"fare\",\"type\": {\"type\":\"record\", 
\"name\":\"fare\",\"fields\": ["
+  + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},"
+  + "{\"name\": \"_hoodie_is_deleted\", \"type\": \"boolean\", 
\"default\": false} ]}";
+
+  @Before
+  public void setUp() throws Exception {
+initResources();
+  }
+
+  @After
+  public void tearDown() {
+cleanupSparkContexts();
+  }
+
+  @Test
+  public void testMORTable() throws Exception {
+tableType = HoodieTableType.MERGE_ON_READ;
+initMetaClient();
+
+// Create the table
+HoodieTableMetaClient.initTableType(metaClient.getHadoopConf(), 

[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-10 Thread GitBox
prashantwason commented on a change in pull request #1457: [HUDI-741] Added 
checks to validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#discussion_r406973660
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/client/TestTableSchemaEvolution.java
 ##
 @@ -0,0 +1,410 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.HoodieClientTestUtils;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieInsertException;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.index.HoodieIndex.IndexType;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.List;
+
+import static 
org.apache.hudi.common.HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA;
+import static 
org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion.VERSION_1;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+
+public class TestTableSchemaEvolution extends TestHoodieClientBase {
+  private final String initCommitTime = "000";
+  private HoodieTableType tableType = HoodieTableType.COPY_ON_WRITE;
+  private HoodieTestDataGenerator dataGenDevolved = new 
HoodieTestDataGenerator(TRIP_EXAMPLE_SCHEMA_DEVOLVED);
+  private HoodieTestDataGenerator dataGenEvolved = new 
HoodieTestDataGenerator(TRIP_EXAMPLE_SCHEMA_EVOLVED);
+
+  // TRIP_EXAMPLE_SCHEMA with a new_field added
+  public static final String TRIP_EXAMPLE_SCHEMA_EVOLVED = "{\"type\": 
\"record\"," + "\"name\": \"triprec\"," + "\"fields\": [ "
+  + "{\"name\": \"timestamp\",\"type\": \"double\"}," + "{\"name\": 
\"_row_key\", \"type\": \"string\"},"
+  + "{\"name\": \"rider\", \"type\": \"string\"}," + "{\"name\": 
\"driver\", \"type\": \"string\"},"
+  + "{\"name\": \"begin_lat\", \"type\": \"double\"}," + "{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"new_field\", \"type\": [\"null\", \"string\"], 
\"default\": null},"
+  + "{\"name\": \"fare\",\"type\": {\"type\":\"record\", 
\"name\":\"fare\",\"fields\": ["
+  + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},"
+  + "{\"name\": \"_hoodie_is_deleted\", \"type\": \"boolean\", 
\"default\": false} ]}";
+  // TRIP_EXAMPLE_SCHEMA with driver field removed
+  public static final String TRIP_EXAMPLE_SCHEMA_DEVOLVED = "{\"type\": 
\"record\"," + "\"name\": \"triprec\"," + "\"fields\": [ "
+  + "{\"name\": \"timestamp\",\"type\": \"double\"}," + "{\"name\": 
\"_row_key\", \"type\": \"string\"},"
+  + "{\"name\": \"rider\", \"type\": \"string\"},"
+  + "{\"name\": \"begin_lat\", \"type\": \"double\"}," + "{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
+  + "{\"name\": \"fare\",\"type\": {\"type\":\"record\", 
\"name\":\"fare\",\"fields\": ["
+  + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},"
+  + "{\"name\": \"_hoodie_is_deleted\", \"type\": \"boolean\", 
\"default\": false} ]}";
+
+  @Before
+  public void setUp() throws Exception {
+initResources();
+  }
+
+  @After
+  public void tearDown() {
+cleanupSparkContexts();
+  }
+
+  @Test
+  public void testMORTable() throws Exception {
+tableType = HoodieTableType.MERGE_ON_READ;
+initMetaClient();
+
+// Create the table
+HoodieTableMetaClient.initTableType(metaClient.getHadoopConf(), 

[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

2020-04-10 Thread GitBox
vinothchandar edited a comment on issue #1498: Migrating parquet table to hudi 
issue [SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-612252471
 
 
   @vontman @ahmed-elfar First of all. Thanks for all the detailed information! 
   
   Answers to the good questions you raised 
   
   > Is that the normal time for initial loading for Hudi tables, or we are 
doing something wrong?
   It's hard to know what normal time is since it depends on schema, machine 
and so many things. But we should n't this very off. Tried to explain few 
things below. 
   
   > Do we need a better cluster/recoures to be able to load the data for the 
first time?, because it is mentioned on Hudi confluence page that COW 
bulkinsert should match vanilla parquet writing + sort only.
   
   If you are trying to ultimately migrate a table (using bulk_insert once) and 
then do updates/deletes. I suggest, testing upserts/deletes rather than 
bulk_insert.. If you primarily want to do bulk_insert alone to get other 
benefits of Hudi. Happy to work with you more and resolve this. Perf is a major 
push for the next release. So we can def collaborate here
   
   
   > Does partitioning improves the upsert and/or compaction time for Hudi 
tables, or just to improve the analytical queries (partition pruning)?
   
   Partitioning would benefit the query performance obviously. But for writing 
itself, the data size matter more, I would say. 
   
   > We have noticed that the most time spent in the data indexing (the 
bulk-insert logic itself) and not the sorting stages/operation before the 
indexing, so how can we improve that? should we provide our own indexing logic?
   
   Nope. you don't have to supply you own indexing or anthing. Bulk insert does 
not do any indexing, it does a global sort (so we can pack records belonging to 
same partition closer into the same file as much) and then writes out files. 
   
   
   **Few observations :** 
   
   - 47 min job is gc-ing quite a bit. So that can affect throughput a lot. 
Have you tried configuring the jvm.
   - I do see fair bit of skews here from sorting, which may be affecting over 
all run times.. #1149 is trying to also provide a non-sorted mode, that 
tradeoffs file sizing for potentially faster writing.
   
   On what could create difference between bulk_insert and spark/parquet :
   
   - I would also set `"hoodie.parquet.compression.codec" -> "SNAPPY"` since 
Hudi uses gzip compression by default, where spark.write.parquet will use 
SNAPPY 
   - Hudi currently does an extra `df.rdd` conversion that could affect 
bulk_insert/insert (upsert/delete workloads are bound by merge costs, this 
matters less there). I don't see that in your UI though.. 
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

2020-04-10 Thread GitBox
vinothchandar commented on issue #1498: Migrating parquet table to hudi issue 
[SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-612252471
 
 
   @vontman @ahmed-elfar First of all. Thanks for all the detailed information! 
   
   Answers to the good questions you raised 
   
   > Is that the normal time for initial loading for Hudi tables, or we are 
doing something wrong?
   It's hard to know what normal time is since it depends on schema, machine 
and so many things. But we should n't this very off. Tried to explain few 
things below. 
   
   > Do we need a better cluster/recoures to be able to load the data for the 
first time?, because it is mentioned on Hudi confluence page that COW 
bulkinsert should match vanilla parquet writing + sort only.
   
   If you are trying to ultimately migrate a table (using bulk_insert once) and 
then do updates/deletes. I suggest, testing upserts/deletes rather than 
bulk_insert.. If you primarily want to do bulk_insert alone to get other 
benefits of Hudi. Happy to work with you more and resolve this. Perf is a major 
push for the next release. So we can def collaborate here
   
   
   > Does partitioning improves the upsert and/or compaction time for Hudi 
tables, or just to improve the analytical queries (partition pruning)?
   
   Partitioning would benefit the query performance obviously. But for writing 
itself, the data size matter more, I would say. 
   
   > We have noticed that the most time spent in the data indexing (the 
bulk-insert logic itself) and not the sorting stages/operation before the 
indexing, so how can we improve that? should we provide our own indexing logic?
   Nope. you don't have to supply you own indexing or anthing. Bulk insert does 
not do any indexing, it does a global sort (so we can pack records belonging to 
same partition closer into the same file as much) and then writes out files. 
   
   
   **Few observations :** 
   
   - 47 min job is gc-ing quite a bit. So that can affect throughput a lot. 
Have you tried configuring the jvm.
   - I do see fair bit of skews here from sorting, which may be affecting over 
all run times.. #1149 is trying to also provide a non-sorted mode, that 
tradeoffs file sizing for potentially faster writing.
   
   On what could create difference between bulk_insert and spark/parquet :
   
   - I would also set `"hoodie.parquet.compression.codec" -> "SNAPPY"` since 
Hudi uses gzip compression by default, where spark.write.parquet will use 
SNAPPY 
   - Hudi currently does an extra `df.rdd` conversion that could affect 
bulk_insert/insert (upsert/delete workloads are bound by merge costs, this 
matters less there). I don't see that in your UI though.. 
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

2020-04-10 Thread GitBox
vinothchandar edited a comment on issue #1498: Migrating parquet table to hudi 
issue [SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-612252471
 
 
   @vontman @ahmed-elfar First of all. Thanks for all the detailed information! 
   
   Answers to the good questions you raised 
   
   > Is that the normal time for initial loading for Hudi tables, or we are 
doing something wrong?
   
   It's hard to know what normal time is since it depends on schema, machine 
and so many things. But we should n't this very off. Tried to explain few 
things below. 
   
   > Do we need a better cluster/recoures to be able to load the data for the 
first time?, because it is mentioned on Hudi confluence page that COW 
bulkinsert should match vanilla parquet writing + sort only.
   
   If you are trying to ultimately migrate a table (using bulk_insert once) and 
then do updates/deletes. I suggest, testing upserts/deletes rather than 
bulk_insert.. If you primarily want to do bulk_insert alone to get other 
benefits of Hudi. Happy to work with you more and resolve this. Perf is a major 
push for the next release. So we can def collaborate here
   
   
   > Does partitioning improves the upsert and/or compaction time for Hudi 
tables, or just to improve the analytical queries (partition pruning)?
   
   Partitioning would benefit the query performance obviously. But for writing 
itself, the data size matter more, I would say. 
   
   > We have noticed that the most time spent in the data indexing (the 
bulk-insert logic itself) and not the sorting stages/operation before the 
indexing, so how can we improve that? should we provide our own indexing logic?
   
   Nope. you don't have to supply you own indexing or anthing. Bulk insert does 
not do any indexing, it does a global sort (so we can pack records belonging to 
same partition closer into the same file as much) and then writes out files. 
   
   
   **Few observations :** 
   
   - 47 min job is gc-ing quite a bit. So that can affect throughput a lot. 
Have you tried configuring the jvm.
   - I do see fair bit of skews here from sorting, which may be affecting over 
all run times.. #1149 is trying to also provide a non-sorted mode, that 
tradeoffs file sizing for potentially faster writing.
   
   On what could create difference between bulk_insert and spark/parquet :
   
   - I would also set `"hoodie.parquet.compression.codec" -> "SNAPPY"` since 
Hudi uses gzip compression by default, where spark.write.parquet will use 
SNAPPY 
   - Hudi currently does an extra `df.rdd` conversion that could affect 
bulk_insert/insert (upsert/delete workloads are bound by merge costs, this 
matters less there). I don't see that in your UI though.. 
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks

2020-04-10 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081014#comment-17081014
 ] 

Prashant Wason commented on HUDI-741:
-

Update: [~varadarb] informed me that schema is also available in the Hoodie 
commit as extraMetadata. This simplifies getting the last used schema for the 
checks.

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 120h
>  Time Spent: 10m
>  Remaining Estimate: 119h 50m
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] n3nash closed pull request #1053: [HUDI-371] : Support CombineInputFormat for RT tables

2020-04-10 Thread GitBox
n3nash closed pull request #1053: [HUDI-371] : Support CombineInputFormat for 
RT tables
URL: https://github.com/apache/incubator-hudi/pull/1053
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-371) CombineInputFormat for realtime tables

2020-04-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-371:

Labels: pull-request-available  (was: )

> CombineInputFormat for realtime tables
> --
>
> Key: HUDI-371
> URL: https://issues.apache.org/jira/browse/HUDI-371
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
>
> Currently, we support combine input format for RO tables to reduce the number 
> of hive map-red tasks spawned in case of large queries, similar concept is 
> needed for RT tables as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] n3nash commented on issue #1053: [HUDI-371] : Support CombineInputFormat for RT tables

2020-04-10 Thread GitBox
n3nash commented on issue #1053: [HUDI-371] : Support CombineInputFormat for RT 
tables
URL: https://github.com/apache/incubator-hudi/pull/1053#issuecomment-612245263
 
 
   Closing this in favor of #1503


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-10 Thread GitBox
codecov-io edited a comment on issue #1486: [HUDI-759] Integrate checkpoint 
privoder with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#issuecomment-609364046
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=h1) 
Report
   > Merging 
[#1486](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/c0f96e072650d39433929c6efe3bc0b2cd882a39=desc)
 will **decrease** coverage by `0.08%`.
   > The diff coverage is `78.83%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1486/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1486  +/-   ##
   
   - Coverage 72.25%   72.16%   -0.09% 
   - Complexity  289  293   +4 
   
 Files   338  339   +1 
 Lines 1594615956  +10 
 Branches   1624 1625   +1 
   
   - Hits  1152111515   -6 
   - Misses 3697 3712  +15 
   - Partials728  729   +1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | `64.70% <72.22%> (-0.71%)` | `22.00 <12.00> (+1.00)` | :arrow_down: |
   | 
[...ities/checkpointing/InitialCheckPointProvider.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NoZWNrcG9pbnRpbmcvSW5pdGlhbENoZWNrUG9pbnRQcm92aWRlci5qYXZh)
 | `80.00% <80.00%> (ø)` | `1.00 <1.00> (?)` | |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `77.93% <80.23%> (-1.22%)` | `11.00 <7.00> (+1.00)` | :arrow_down: |
   | 
[...lities/checkpointing/KafkaConnectHdfsProvider.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NoZWNrcG9pbnRpbmcvS2Fma2FDb25uZWN0SGRmc1Byb3ZpZGVyLmphdmE=)
 | `92.30% <88.88%> (ø)` | `13.00 <6.00> (+1.00)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `72.44% <100.00%> (ø)` | `37.00 <0.00> (ø)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `40.00% <0.00%> (-40.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `58.33% <0.00%> (-13.89%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1486/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=footer).
 Last update 
[c0f96e0...639b972](https://codecov.io/gh/apache/incubator-hudi/pull/1486?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-10 Thread GitBox
garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint privoder 
with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#issuecomment-612171042
 
 
   Addressed some comments, summary:
   
   - Removed `--bootstrap-from` option in the delta streamer. Use 
`hoodie.deltastreamer.checkpoint.provider.path` field in the props instead. 
   - Use TypedProperty to construct the `InitialCheckPointProvider` and 
`init(FileSystem fs)` to initialize the class
   - Keep `hiveConf` as the variable name even change `HiveConf` type to 
`Configuration`. Open to discussion if you guys don't agree.
   - Not able to replace all `null` in this PR because `null` was served as a 
flag in the delta streamer workflow, this might change the behavior of other 
classes using the `TypedProperty` field. Will need a separate PR to do the code 
refactoring. 
   - The style check tool automatically adds `final` and `this` to match the 
stylecheck.xml. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Resolved] (HUDI-571) Modify Hudi CLI to show archived commits

2020-04-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish resolved HUDI-571.
-
Resolution: Resolved

> Modify Hudi CLI to show archived commits
> 
>
> Key: HUDI-571
> URL: https://issues.apache.org/jira/browse/HUDI-571
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Hudi CLI has 'show archived commits' command which is not very helpful
>  
> {code:java}
> ->show archived commits
> ===> Showing only 10 archived commits <===
>     
>     | CommitTime    | CommitType|
>     |===|
>     | 2019033304| commit    |
>     | 20190323220154| commit    |
>     | 20190323220154| commit    |
>     | 20190323224004| commit    |
>     | 20190323224013| commit    |
>     | 20190323224229| commit    |
>     | 20190323224229| commit    |
>     | 20190323232849| commit    |
>     | 20190323233109| commit    |
>     | 20190323233109| commit    |
>  {code}
> Modify or introduce new command to make it easy to debug
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-571) Modify Hudi CLI to show archived commits

2020-04-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish reopened HUDI-571:
-

> Modify Hudi CLI to show archived commits
> 
>
> Key: HUDI-571
> URL: https://issues.apache.org/jira/browse/HUDI-571
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Hudi CLI has 'show archived commits' command which is not very helpful
>  
> {code:java}
> ->show archived commits
> ===> Showing only 10 archived commits <===
>     
>     | CommitTime    | CommitType|
>     |===|
>     | 2019033304| commit    |
>     | 20190323220154| commit    |
>     | 20190323220154| commit    |
>     | 20190323224004| commit    |
>     | 20190323224013| commit    |
>     | 20190323224229| commit    |
>     | 20190323224229| commit    |
>     | 20190323232849| commit    |
>     | 20190323233109| commit    |
>     | 20190323233109| commit    |
>  {code}
> Modify or introduce new command to make it easy to debug
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vontman commented on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

2020-04-10 Thread GitBox
vontman commented on issue #1498: Migrating parquet table to hudi issue 
[SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-612156801
 
 
   @bvaradar 
   The `.commit` file: 
https://gist.github.com/vontman/e2242b6b0bd3cfc126ed725bb85ccdea
   
   This is the schema for the table as well: 
https://gist.github.com/vontman/761e34981994fa36d3c9d22db5a80ea8
   
   Screenshots from the 47min run:
   
   
![image](https://user-images.githubusercontent.com/4175383/79013450-0aba8580-7b69-11ea-9e9d-892e3e71d260.png)
   
   
![image](https://user-images.githubusercontent.com/4175383/79013460-11e19380-7b69-11ea-9104-fcbea62c159e.png)
   
   
   
![image](https://user-images.githubusercontent.com/4175383/79013528-3e95ab00-7b69-11ea-8e65-84f4655c17c0.png)
   
   
![image](https://user-images.githubusercontent.com/4175383/79013473-1d34bf00-7b69-11ea-969d-3f3ae81dd61e.png)
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1503: [HUDI-371] : Supporting combine input format RT tables

2020-04-10 Thread GitBox
n3nash commented on issue #1503: [HUDI-371] : Supporting combine input format 
RT tables
URL: https://github.com/apache/incubator-hudi/pull/1503#issuecomment-612151106
 
 
   Closed PR -> https://github.com/apache/incubator-hudi/pull/1053 in favor of 
this one. @bvaradar PTAL


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-633) archival fails with large clean files

2020-04-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-633:

Status: Open  (was: New)

> archival fails with large clean files
> -
>
> Key: HUDI-633
> URL: https://issues.apache.org/jira/browse/HUDI-633
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Caused by: java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at 
> org.apache.avro.io.BufferedBinaryEncoder$OutputStreamSink.innerWrite(BufferedBinaryEncoder.java:216)
> at 
> org.apache.avro.io.BufferedBinaryEncoder.flushBuffer(BufferedBinaryEncoder.java:93)
> at 
> org.apache.avro.io.BufferedBinaryEncoder.ensureBounds(BufferedBinaryEncoder.java:108)
> at 
> org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:153)
> at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:55)
> at org.apache.avro.io.Encoder.writeString(Encoder.java:121)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:213)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:208)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:76)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:138)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:180)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:69)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
> 10:01
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)
> at 
> com.uber.hoodie.common.table.log.block.HoodieAvroDataBlock.getContentBytes(HoodieAvroDataBlock.java:124)
> at 
> com.uber.hoodie.common.table.log.HoodieLogFormatWriter.appendBlock(HoodieLogFormatWriter.java:126)
> at 
> com.uber.hoodie.io.HoodieCommitArchiveLog.writeToFile(HoodieCommitArchiveLog.java:267)
> at 
> com.uber.hoodie.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:249)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-633) archival fails with large clean files

2020-04-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish resolved HUDI-633.
-
Resolution: Won't Fix

Decided to go with different approach by not restricting size of clean files

> archival fails with large clean files
> -
>
> Key: HUDI-633
> URL: https://issues.apache.org/jira/browse/HUDI-633
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Caused by: java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at 
> org.apache.avro.io.BufferedBinaryEncoder$OutputStreamSink.innerWrite(BufferedBinaryEncoder.java:216)
> at 
> org.apache.avro.io.BufferedBinaryEncoder.flushBuffer(BufferedBinaryEncoder.java:93)
> at 
> org.apache.avro.io.BufferedBinaryEncoder.ensureBounds(BufferedBinaryEncoder.java:108)
> at 
> org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:153)
> at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:55)
> at org.apache.avro.io.Encoder.writeString(Encoder.java:121)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:213)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:208)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:76)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:138)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:180)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:69)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
> 10:01
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)
> at 
> com.uber.hoodie.common.table.log.block.HoodieAvroDataBlock.getContentBytes(HoodieAvroDataBlock.java:124)
> at 
> com.uber.hoodie.common.table.log.HoodieLogFormatWriter.appendBlock(HoodieLogFormatWriter.java:126)
> at 
> com.uber.hoodie.io.HoodieCommitArchiveLog.writeToFile(HoodieCommitArchiveLog.java:267)
> at 
> com.uber.hoodie.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:249)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-687) incremental reads on MOR tables using RO view can lead to missing updates

2020-04-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-687:

Status: In Progress  (was: Open)

> incremental reads on MOR tables using RO view can lead to missing updates
> -
>
> Key: HUDI-687
> URL: https://issues.apache.org/jira/browse/HUDI-687
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: satish
>Assignee: satish
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> example timeline:
> t0 -> create bucket1.parquet
> t1 -> create and append updates bucket1.log
> t2 -> request compaction 
> t3 -> create bucket2.parquet
> if compaction at t2 takes a long time, incremental reads using 
> HoodieParquetInputFormat can skip data ingested at t1 leading to 'data loss' 
> (Data will still be on disk, but incremental readers wont see it because its 
> in log file and readers move to t3)
> To workaround this problem, we want to stop returning data belonging to 
> commits > t1. After compaction is complete, incremental reader would see 
> updates in t2, t3, so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-687) incremental reads on MOR tables using RO view can lead to missing updates

2020-04-10 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish updated HUDI-687:

Status: Open  (was: New)

> incremental reads on MOR tables using RO view can lead to missing updates
> -
>
> Key: HUDI-687
> URL: https://issues.apache.org/jira/browse/HUDI-687
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: satish
>Assignee: satish
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> example timeline:
> t0 -> create bucket1.parquet
> t1 -> create and append updates bucket1.log
> t2 -> request compaction 
> t3 -> create bucket2.parquet
> if compaction at t2 takes a long time, incremental reads using 
> HoodieParquetInputFormat can skip data ingested at t1 leading to 'data loss' 
> (Data will still be on disk, but incremental readers wont see it because its 
> in log file and readers move to t3)
> To workaround this problem, we want to stop returning data belonging to 
> commits > t1. After compaction is complete, incremental reader would see 
> updates in t2, t3, so on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar merged pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-04-10 Thread GitBox
bvaradar merged pull request #1396: [HUDI-687] Stop incremental reader on RO 
table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (8c7cef3 -> c0f96e0)

2020-04-10 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 8c7cef3  [HUDI - 738] Add validation to DeltaStreamer to fail fast 
when filterDupes is enabled on UPSERT mode. (#1505)
 add c0f96e0  [HUDI-687] Stop incremental reader on RO table when there is 
a pending compaction (#1396)

No new revisions were added by this update.

Summary of changes:
 .../hudi/common/HoodieMergeOnReadTestUtils.java|  40 +--
 .../apache/hudi/table/TestCopyOnWriteTable.java|  91 +++
 .../apache/hudi/table/TestMergeOnReadTable.java| 279 +
 .../table/timeline/HoodieDefaultTimeline.java  |   9 +-
 .../hudi/common/table/timeline/HoodieTimeline.java |   5 +
 .../table/timeline/TestHoodieActiveTimeline.java   |  16 +-
 .../org/apache/hudi/hadoop/HoodieHiveUtil.java |  24 ++
 .../hudi/hadoop/HoodieParquetInputFormat.java  |  67 +++--
 .../realtime/HoodieParquetRealtimeInputFormat.java |   7 +
 .../apache/hudi/hadoop/InputFormatTestUtil.java|   7 +-
 .../hudi/hadoop/TestHoodieParquetInputFormat.java  | 122 +
 11 files changed, 541 insertions(+), 126 deletions(-)



[incubator-hudi] branch hudi_test_suite_refactor updated (5ee8a85 -> d2d2866)

2020-04-10 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard 5ee8a85  Build fixes after rebase
 discard 3e2e710  Fix Compilation Issues + Port Bug Fixes
omit 29b4fdf  [HUDI-394] Provide a basic implementation of test suite
 add 6808559  [HUDI-717] Fixed usage of HiveDriver for DDL statements. 
(#1416)
 add deb95ad  [HUDI-748] Adding .codecov.yml to set exclusions for code 
coverage reports. (#1468)
 add 575d87c  HUDI-644 kafka connect checkpoint provider (#1453)
 add eaf6cc2  [HUDI-756] Organize Cleaning Action execution into a single 
package in hudi-client (#1485)
 add b5d093a  [MINOR] Clear up the redundant comment. (#1489)
 add d610252  [HUDI-288]: Add support for ingesting multiple kafka streams 
in a single DeltaStreamer deployment (#1150)
 add 4e5c867  [HUDI-740]Fix can not specify the sparkMaster and code clean 
for SparkUtil (#1452)
 add f7b55af  [MINOR] Fix typo in TimelineService (#1497)
 add 1f6be82  [HUDI-758] Modify Integration test to include incremental 
queries for MOR tables
 add 3c80342  rename variable per review comments
 add 996f761  Trying git merge --squash
 add f5f34bb  [HUDI-568] Improve unit test coverage
 add 8c7cef3  [HUDI - 738] Add validation to DeltaStreamer to fail fast 
when filterDupes is enabled on UPSERT mode. (#1505)
 add 1450004  [HUDI-394] Provide a basic implementation of test suite
 add ab98c40  Fix Compilation Issues + Port Bug Fixes
 add b519057  Build fixes after rebase
 add d2d2866  more fixes

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (5ee8a85)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (d2d2866)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .codecov.yml   |  44 +++
 ...n_commit_time.sh => get_min_commit_time_cow.sh} |   0
 ...n_commit_time.sh => get_min_commit_time_mor.sh} |   2 +-
 ...ntal.commands => hive-incremental-cow.commands} |   2 +-
 ...n.commands => hive-incremental-mor-ro.commands} |   9 +-
 ...n.commands => hive-incremental-mor-rt.commands} |   9 +-
 .../org/apache/hudi/cli/HoodieCliSparkConfig.java  |  46 +++
 .../apache/hudi/cli/commands/CleansCommand.java|   2 +-
 .../hudi/cli/commands/CompactionCommand.java   |  16 +-
 .../org/apache/hudi/cli/commands/SparkMain.java|  62 ++--
 .../java/org/apache/hudi/cli/utils/SparkUtil.java  |  36 +-
 .../apache/hudi/client/AbstractHoodieClient.java   |   4 -
 .../hudi/client/AbstractHoodieWriteClient.java |  16 +-
 .../org/apache/hudi/client/HoodieCleanClient.java  | 197 --
 .../org/apache/hudi/client/HoodieReadClient.java   |   2 +-
 .../org/apache/hudi/client/HoodieWriteClient.java  |  62 ++--
 .../apache/hudi/table/HoodieCommitArchiveLog.java  |  12 +-
 .../apache/hudi/table/HoodieCopyOnWriteTable.java  | 172 +
 .../apache/hudi/table/HoodieMergeOnReadTable.java  |   5 +-
 .../java/org/apache/hudi/table/HoodieTable.java|  57 ++-
 .../action/BaseActionExecutor.java}|  25 +-
 .../table/action/clean/CleanActionExecutor.java| 280 +++
 .../clean/CleanPlanner.java}   |  11 +-
 .../table/action/clean/PartitionCleanStat.java |  70 
 .../compact/HoodieMergeOnReadTableCompactor.java   |   2 +-
 .../org/apache/hudi/client/TestClientRollback.java |   6 +-
 .../apache/hudi/client/TestHoodieClientBase.java   |  11 +-
 .../TestHoodieClientOnCopyOnWriteStorage.java  |   2 +-
 .../java/org/apache/hudi/client/TestMultiFS.java   |   4 +-
 .../hudi/client/TestUpdateSchemaEvolution.java |   6 +-
 .../hudi/common/HoodieTestDataGenerator.java   | 157 ++--
 .../java/org/apache/hudi/index/TestHbaseIndex.java |  26 +-
 .../hudi/index/bloom/TestHoodieBloomIndex.java |  16 +-
 .../index/bloom/TestHoodieGlobalBloomIndex.java|   8 +-
 .../io/storage/TestHoodieStorageWriterFactory.java |   2 +-
 .../java/org/apache/hudi/table/TestCleaner.java|  48 ++-
 .../apache/hudi/table/TestCopyOnWriteTable.java|  20 +-
 .../apache/hudi/table/TestMergeOnReadTable.java|  48 ++-
 .../hudi/table/compact/TestAsyncCompaction.java|   2 +-
 .../hudi/table/compact/TestHoodieCompactor.java|  10 +-
 .../org/apache/hudi/common/model/HoodieKey.java|   

[GitHub] [incubator-hudi] lamber-ken closed issue #1499: [SUPPORT] DeltaStreamer - NoClassDefFoundError for HiveDriver

2020-04-10 Thread GitBox
lamber-ken closed issue #1499: [SUPPORT] DeltaStreamer - NoClassDefFoundError 
for HiveDriver
URL: https://github.com/apache/incubator-hudi/issues/1499
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

2020-04-10 Thread GitBox
bvaradar commented on issue #1498: Migrating parquet table to hudi issue 
[SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-612113721
 
 
   @vontman : This looks to me like the columnar file creation is taking close 
to 17 mins out of 25 mins.  Can you also attach the commit metadata (.commit 
file under .hoodie folder corresponding to each runs) and the UI screenshots 
for the 47 min run too. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-738) Add error msg in DeltaStreamer if `filterDupes=true` is enabled for `operation=UPSERT`.

2020-04-10 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-738:
---
Status: Closed  (was: Patch Available)

> Add error msg in DeltaStreamer if  `filterDupes=true` is enabled for 
> `operation=UPSERT`. 
> -
>
> Key: HUDI-738
> URL: https://issues.apache.org/jira/browse/HUDI-738
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: DeltaStreamer, newbie, Usability
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This checks for dedupes with existing records in the table and thus ignores 
> updates. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated: [HUDI - 738] Add validation to DeltaStreamer to fail fast when filterDupes is enabled on UPSERT mode. (#1505)

2020-04-10 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 8c7cef3  [HUDI - 738] Add validation to DeltaStreamer to fail fast 
when filterDupes is enabled on UPSERT mode. (#1505)
8c7cef3 is described below

commit 8c7cef3e50471d3645b207d97d515107764688c9
Author: Bhavani Sudha Saktheeswaran 
AuthorDate: Fri Apr 10 08:58:55 2020 -0700

[HUDI - 738] Add validation to DeltaStreamer to fail fast when filterDupes 
is enabled on UPSERT mode. (#1505)

Summary:
This fix ensures for UPSERT operation, '--filter-dupes' is disabled and 
fails fast if not. Otherwise it would drop all updates silently and only take 
in new records.
---
 .../org/apache/hudi/utilities/deltastreamer/DeltaSync.java |  5 -
 .../hudi/utilities/deltastreamer/HoodieDeltaStreamer.java  |  8 +++-
 .../deltastreamer/HoodieMultiTableDeltaStreamer.java   |  3 +++
 .../org/apache/hudi/utilities/TestHoodieDeltaStreamer.java | 14 --
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
index c964c91..7ec7303 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
@@ -173,9 +173,6 @@ public class DeltaSync implements Serializable {
 UtilHelpers.createSource(cfg.sourceClassName, props, jssc, 
sparkSession, schemaProvider));
 
 this.hiveConf = hiveConf;
-if (cfg.filterDupes) {
-  cfg.operation = cfg.operation == Operation.UPSERT ? Operation.INSERT : 
cfg.operation;
-}
 
 // If schemaRegistry already resolved, setup write-client
 setupWriteClient();
@@ -348,8 +345,6 @@ public class DeltaSync implements Serializable {
 Option scheduledCompactionInstant = Option.empty();
 // filter dupes if needed
 if (cfg.filterDupes) {
-  // turn upserts to insert
-  cfg.operation = cfg.operation == Operation.UPSERT ? Operation.INSERT : 
cfg.operation;
   records = DataSourceUtils.dropDuplicates(jssc, records, 
writeClient.getConfig());
 }
 
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
index 8368478..0325eaf 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
@@ -388,17 +388,15 @@ public class HoodieDeltaStreamer implements Serializable {
 tableType = HoodieTableType.valueOf(cfg.tableType);
   }
 
+  ValidationUtils.checkArgument(!cfg.filterDupes || cfg.operation != 
Operation.UPSERT,
+  "'--filter-dupes' needs to be disabled when '--op' is 'UPSERT' to 
ensure updates are not missed.");
+
   this.props = properties != null ? properties : 
UtilHelpers.readConfig(fs, new Path(cfg.propsFilePath), 
cfg.configs).getConfig();
   LOG.info("Creating delta streamer with configs : " + props.toString());
   this.schemaProvider = 
UtilHelpers.createSchemaProvider(cfg.schemaProviderClassName, props, jssc);
 
-  if (cfg.filterDupes) {
-cfg.operation = cfg.operation == Operation.UPSERT ? Operation.INSERT : 
cfg.operation;
-  }
-
   deltaSync = new DeltaSync(cfg, sparkSession, schemaProvider, props, 
jssc, fs, hiveConf,
 this::onInitializingWriteClient);
-
 }
 
 public DeltaSyncService(HoodieDeltaStreamer.Config cfg, JavaSparkContext 
jssc, FileSystem fs, HiveConf hiveConf)
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
index 74455f2..d9c8f83 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
@@ -24,6 +24,7 @@ import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.ValidationUtils;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.utilities.UtilHelpers;
 import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
@@ -66,6 +67,8 @@ public class HoodieMultiTableDeltaStreamer {
 this.jssc = jssc;
 String commonPropsFile 

[GitHub] [incubator-hudi] bhasudha merged pull request #1505: [HUDI-738] Add validation to DeltaStreamer when filtetDupes is enabled on UPSERT mode

2020-04-10 Thread GitBox
bhasudha merged pull request #1505: [HUDI-738] Add validation to DeltaStreamer 
when filtetDupes is enabled on UPSERT mode
URL: https://github.com/apache/incubator-hudi/pull/1505
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-738) Add error msg in DeltaStreamer if `filterDupes=true` is enabled for `operation=UPSERT`.

2020-04-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-738:

Labels: pull-request-available  (was: )

> Add error msg in DeltaStreamer if  `filterDupes=true` is enabled for 
> `operation=UPSERT`. 
> -
>
> Key: HUDI-738
> URL: https://issues.apache.org/jira/browse/HUDI-738
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: DeltaStreamer, newbie, Usability
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
>
> This checks for dedupes with existing records in the table and thus ignores 
> updates. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io commented on issue #1505: [HUDI-738] Add validation to DeltaStreamer when filtetDupes is enabled on UPSERT mode

2020-04-10 Thread GitBox
codecov-io commented on issue #1505: [HUDI-738] Add validation to DeltaStreamer 
when filtetDupes is enabled on UPSERT mode
URL: https://github.com/apache/incubator-hudi/pull/1505#issuecomment-612079202
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1505?src=pr=h1) 
Report
   > Merging 
[#1505](https://codecov.io/gh/apache/incubator-hudi/pull/1505?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/f5f34bb1c16e6d070668486eba2a29f554c0bbc7=desc)
 will **increase** coverage by `0.00%`.
   > The diff coverage is `50.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1505/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1505?src=pr=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1505   +/-   ##
   =
 Coverage 72.15%   72.15%   
   + Complexity  290  289-1 
   =
 Files   338  338   
 Lines 1592915926-3 
 Branches   1625 1622-3 
   =
   - Hits  1149411492-2 
   - Misses 3704 3705+1 
   + Partials731  729-2 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1505?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/incubator-hudi/pull/1505/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `72.44% <ø> (+0.58%)` | `37.00 <0.00> (-1.00)` | :arrow_up: |
   | 
[...s/deltastreamer/HoodieMultiTableDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1505/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllTXVsdGlUYWJsZURlbHRhU3RyZWFtZXIuamF2YQ==)
 | `78.39% <0.00%> (-0.49%)` | `18.00 <0.00> (ø)` | |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1505/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `79.14% <100.00%> (+0.37%)` | `10.00 <0.00> (ø)` | |
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1505/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1505/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `80.00% <0.00%> (+40.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1505?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1505?src=pr=footer).
 Last update 
[f5f34bb...7e05fda](https://codecov.io/gh/apache/incubator-hudi/pull/1505?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1504: [HUDI-780] Add junit 5

2020-04-10 Thread GitBox
vinothchandar commented on issue #1504: [HUDI-780] Add junit 5
URL: https://github.com/apache/incubator-hudi/pull/1504#issuecomment-612058285
 
 
   @yanghua @xushiyan my style has been to only use javadocs when it warrants 
them. forced writing of trivial/obvious docs is not very helpful.. That said, 
public APIs should have detailed/accurate javadocs, test framework/utility 
classes should. Not sure if we want to burn cycles (and increase file lengths) 
just for sake of having javadocs.. 
   
   I know this is subjective. but atleast goes with few books I have read in 
the past and what made sense to me.. We can also start a separate DISCUSS on 
this and defer.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-432) Benchmark HFile for scan vs seek

2020-04-10 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080532#comment-17080532
 ] 

Vinoth Chandar commented on HUDI-432:
-

On s3, we should also consider the random read optimizations.. 
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-get-requests.html
 

> Benchmark HFile for scan vs seek
> 
>
> Key: HUDI-432
> URL: https://issues.apache.org/jira/browse/HUDI-432
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Performance, Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: HFile benchmark.xlsx, HFile benchmark_withS3.xlsx, 
> Screen Shot 2020-01-03 at 6.44.25 PM.png, Screen Shot 2020-03-09 at 12.22.54 
> AM.png
>
>
> We want to benchmark HFile scan vs seek as we intend to use HFile to record 
> indexing. HFile will be used inline in hudi log for index purposes. 
> So, as part of benchmarking, we want to see when does scan out performs seek. 
> This is our experiment set up.
> keysToRead = no of keys to be looked up. // differs for different exp runs 
> like 100k, 200k, 500k, 1M. 
> N = no of iterations
>  
> {code:java}
> 1M entries were written to a single HFile as key value pairs. 
> Also, stored the keys in a separate file(key_file).
> keyList = read all keys from key_file
> for N no of iterations
> {
> shuffle keyList 
> trim the list to keysToRead 
> start timer HFile 
> read benchmark(scan/seek) 
> end timer
> }
> found avg for all timers captured
> {code}
>  
>  
> Result:
> Scan outperforms seek somewhere around 350k to 400k look ups out of 1M 
> entries with optimized configs.
>   !Screen Shot 2020-01-03 at 6.44.25 PM.png!
> Results can be found here: [^HFile benchmark.xlsx]
> Source for benchmarking can be found here: 
> [https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark

2020-04-10 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080530#comment-17080530
 ] 

Vinoth Chandar commented on HUDI-783:
-

Gave you contributor permissions .. and assigned the ticket to you [~vino] 

> Add official python support to create hudi datasets using pyspark
> -
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Utilities
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: features
> Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark

2020-04-10 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080527#comment-17080527
 ] 

Vinoth Chandar commented on HUDI-783:
-

This is great stuff.. Look forward to this.. We could probably use this as a 
parent task and create sub tasks under neath them, for each of those things you 
mention?

> Add official python support to create hudi datasets using pyspark
> -
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Utilities
>Reporter: Vinoth Govindarajan
>Priority: Major
>  Labels: features
> Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-783) Add official python support to create hudi datasets using pyspark

2020-04-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-783:

Status: Open  (was: New)

> Add official python support to create hudi datasets using pyspark
> -
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Utilities
>Reporter: Vinoth Govindarajan
>Priority: Major
>  Labels: features
> Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] hddong commented on issue #1452: [HUDI-740]Fix can not specify the sparkMaster and code clean for SparkUtil

2020-04-10 Thread GitBox
hddong commented on issue #1452: [HUDI-740]Fix can not specify the sparkMaster 
and code clean for SparkUtil
URL: https://github.com/apache/incubator-hudi/pull/1452#issuecomment-612052977
 
 
   @yanghua @pratyakshsharma thanks for your review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-738) Add error msg in DeltaStreamer if `filterDupes=true` is enabled for `operation=UPSERT`.

2020-04-10 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-738:
---
Status: Patch Available  (was: In Progress)

> Add error msg in DeltaStreamer if  `filterDupes=true` is enabled for 
> `operation=UPSERT`. 
> -
>
> Key: HUDI-738
> URL: https://issues.apache.org/jira/browse/HUDI-738
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: DeltaStreamer, newbie, Usability
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>
> This checks for dedupes with existing records in the table and thus ignores 
> updates. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bhasudha opened a new pull request #1505: [HUDI - 738] Add validation to DeltaStreamer when filtetDupes is enabled on UPSERT mode

2020-04-10 Thread GitBox
bhasudha opened a new pull request #1505: [HUDI - 738] Add validation to 
DeltaStreamer when filtetDupes is enabled on UPSERT mode
URL: https://github.com/apache/incubator-hudi/pull/1505
 
 
   Summary:
   This fix ensures for UPSERT operation, '--filter-dupes' is disabled and 
fails fast if not. Otherwise it would drop all updates silently and only take 
in new records.
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-764) Implement HoodieOrcWriter

2020-04-10 Thread renyi.bao (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080447#comment-17080447
 ] 

renyi.bao commented on HUDI-764:


Sorry, I made a mistake

> Implement HoodieOrcWriter
> -
>
> Key: HUDI-764
> URL: https://issues.apache.org/jira/browse/HUDI-764
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>
> Implement HoodieOrcWriter
> * Avro to ORC schema
> * Write record in row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-764) Implement HoodieOrcWriter

2020-04-10 Thread renyi.bao (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renyi.bao updated HUDI-764:
---
Status: In Progress  (was: Open)

> Implement HoodieOrcWriter
> -
>
> Key: HUDI-764
> URL: https://issues.apache.org/jira/browse/HUDI-764
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>
> Implement HoodieOrcWriter
> * Avro to ORC schema
> * Write record in row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-764) Implement HoodieOrcWriter

2020-04-10 Thread renyi.bao (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renyi.bao updated HUDI-764:
---
Status: Patch Available  (was: In Progress)

> Implement HoodieOrcWriter
> -
>
> Key: HUDI-764
> URL: https://issues.apache.org/jira/browse/HUDI-764
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>
> Implement HoodieOrcWriter
> * Avro to ORC schema
> * Write record in row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-444) Refactor the codes based on scala codestyle NullChecker rule

2020-04-10 Thread Kotomi (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080383#comment-17080383
 ] 

Kotomi commented on HUDI-444:
-

removing `return` may cause different behavior.

 

for example:

in hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
{code:java}
  if (mode == SaveMode.Ignore && exists) {
log.warn(s"hoodie dataset at $basePath already exists. Ignoring & not 
performing actual writes.")
-   return (true, common.util.Option.empty())
+   (true, common.util.Option.empty())
  }{code}
and
{code:java}
 if (hoodieRecords.isEmpty()) {
   log.info("new batch has no new records, skipping...")
-  return (true, common.util.Option.empty())
+  (true, common.util.Option.empty())
 }{code}
such conditions became no use at all.

 

> Refactor the codes based on scala codestyle NullChecker rule
> 
>
> Key: HUDI-444
> URL: https://issues.apache.org/jira/browse/HUDI-444
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Refactor the codes based on scala codestyle NullChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

2020-04-10 Thread GitBox
lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-611921813
 
 
   Hi, here is spark command
   ```
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
   --master 'local[2]' \
   --driver-memory 6G \
   --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] tverdokhlebd commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

2020-04-10 Thread GitBox
tverdokhlebd commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-611920831
 
 
   > I run those operations with local[2] and 6GB driver memory, still worked 
fine.
   
   How did you set memory? I don't see any memory config...


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-783) Add official python support to create hudi datasets using pyspark

2020-04-10 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-783:
-
Description: 
*Goal:*
 As a pyspark user, I would like to read/write hudi datasets using pyspark.

There are several components to achieve this goal.
 # Create a hudi-pyspark package that users can import and start 
reading/writing hudi datasets.
 # Explain how to read/write hudi datasets using pyspark in a blog 
post/documentation.
 # Add the hudi-pyspark module to the hudi demo docker along with the 
instructions.
 # Make the package available as part of the [spark packages 
index|https://spark-packages.org/] and [python package index|https://pypi.org/]

hudi-pyspark packages should implement HUDI data source API for Apache Spark 
using which HUDI files can be read as DataFrame and write to any Hadoop 
supported file system.

Usage pattern after we launch this feature should be something like this:

Install the package using:
{code:java}
pip install hudi-pyspark{code}
or

Include hudi-pyspark package in your Spark Applications using:

spark-shell, pyspark, or spark-submit
{code:java}
> $SPARK_HOME/bin/spark-shell --packages 
> org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
 

 

 

 

 

  was:
*Goal:*
 As a pyspark user, I would like to read/write hudi datasets using pyspark.

There are several components to achieve this goal.
 # Create a hudi-pyspark package that users can import and start 
reading/writing hudi datasets.
 # Explain how to read/write hudi datasets using pyspark in a blog 
post/documentation.
 # Add the hudi-pyspark module to the hudi demo docker along with the 
instructions.
 # Make the package available as part of the [spark packages 
index|https://spark-packages.org/] and [python package 
index|[https://pypi.org/]]

hudi-pyspark packages should implement HUDI data source API for Apache Spark 
using which HUDI files can be read as DataFrame and write to any Hadoop 
supported file system.

Usage pattern after we launch this feature should be something like this:

Install the package using:
{code:java}
pip install hudi-pyspark{code}
or

Include hudi-pyspark package in your Spark Applications using:

spark-shell, pyspark, or spark-submit
{code:java}
> $SPARK_HOME/bin/spark-shell --packages 
> org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
 

 

 

 

 


> Add official python support to create hudi datasets using pyspark
> -
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Utilities
>Reporter: Vinoth Govindarajan
>Priority: Major
>  Labels: features
> Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-783) Add official python support to create hudi datasets using pyspark

2020-04-10 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-783:
-
Description: 
*Goal:*
 As a pyspark user, I would like to read/write hudi datasets using pyspark.

There are several components to achieve this goal.
 # Create a hudi-pyspark package that users can import and start 
reading/writing hudi datasets.
 # Explain how to read/write hudi datasets using pyspark in a blog 
post/documentation.
 # Add the hudi-pyspark module to the hudi demo docker along with the 
instructions.
 # Make the package available as part of the [spark packages 
index|https://spark-packages.org/] and [python package 
index|[https://pypi.org/]]

hudi-pyspark packages should implement HUDI data source API for Apache Spark 
using which HUDI files can be read as DataFrame and write to any Hadoop 
supported file system.

Usage pattern after we launch this feature should be something like this:

Install the package using:
{code:java}
pip install hudi-pyspark{code}
or

Include hudi-pyspark package in your Spark Applications using:

spark-shell, pyspark, or spark-submit
{code:java}
> $SPARK_HOME/bin/spark-shell --packages 
> org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
 

 

 

 

 

  was:
*Goal:*
As a pyspark user, I would like to read/write hudi datasets using pyspark.

There are several components to achieve this goal.
 # Create a hudi-pyspark package that users can import and start 
reading/writing hudi datasets.
 # Explain how to read/write hudi datasets using pyspark in a blog 
post/documentation.
 # Add the hudi-pyspark module to the hudi demo docker along with the 
instructions.
 # Make the package available as part of the [spark packages 
index|https://spark-packages.org/] and [python package 
index|[https://pypi.org/].]

hudi-pyspark packages should implement HUDI data source API for Apache Spark 
using which HUDI files can be read as DataFrame and write to any Hadoop 
supported file system.

Usage pattern after we launch this feature should be something like this:

Install the package using:
{code:java}
pip install hudi-pyspark{code}
or

Include hudi-pyspark package in your Spark Applications using:

spark-shell, pyspark, or spark-submit
{code:java}
> $SPARK_HOME/bin/spark-shell --packages 
> org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
 

 

 

 

 


> Add official python support to create hudi datasets using pyspark
> -
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Utilities
>Reporter: Vinoth Govindarajan
>Priority: Major
>  Labels: features
> Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|[https://pypi.org/]]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-783) Add official python support to create hudi datasets using pyspark

2020-04-10 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-783:


 Summary: Add official python support to create hudi datasets using 
pyspark
 Key: HUDI-783
 URL: https://issues.apache.org/jira/browse/HUDI-783
 Project: Apache Hudi (incubating)
  Issue Type: Wish
  Components: Utilities
Reporter: Vinoth Govindarajan
 Fix For: 0.6.0


*Goal:*
As a pyspark user, I would like to read/write hudi datasets using pyspark.

There are several components to achieve this goal.
 # Create a hudi-pyspark package that users can import and start 
reading/writing hudi datasets.
 # Explain how to read/write hudi datasets using pyspark in a blog 
post/documentation.
 # Add the hudi-pyspark module to the hudi demo docker along with the 
instructions.
 # Make the package available as part of the [spark packages 
index|https://spark-packages.org/] and [python package 
index|[https://pypi.org/].]

hudi-pyspark packages should implement HUDI data source API for Apache Spark 
using which HUDI files can be read as DataFrame and write to any Hadoop 
supported file system.

Usage pattern after we launch this feature should be something like this:

Install the package using:
{code:java}
pip install hudi-pyspark{code}
or

Include hudi-pyspark package in your Spark Applications using:

spark-shell, pyspark, or spark-submit
{code:java}
> $SPARK_HOME/bin/spark-shell --packages 
> org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-10 Thread GitBox
garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint privoder 
with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#issuecomment-611897189
 
 
   I added the save action and checkstyle as documented, not sure which one 
triggered all those `final` and `this`. Is that ok to add those? As a scala 
programmer I do prefer to use `val` all the time and the Google Java Guide does 
encourage using final, but I am not sure what is the preference of Hudi. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-10 Thread GitBox
garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate 
checkpoint privoder with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#discussion_r406617865
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
 ##
 @@ -394,6 +394,26 @@ public void testProps() {
 props.getString("hoodie.datasource.write.keygenerator.class"));
   }
 
+  @Test
+  public void testInitialCheckpointProvider() throws IOException {
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-10 Thread GitBox
garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate 
checkpoint privoder with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#discussion_r406617783
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -90,35 +90,33 @@
 
   public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc) throws 
IOException {
 this(cfg, jssc, FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration()),
-getDefaultHiveConf(jssc.hadoopConfiguration()));
+jssc.hadoopConfiguration(), null);
   }
 
   public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, 
TypedProperties props) throws IOException {
 this(cfg, jssc, FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration()),
-getDefaultHiveConf(jssc.hadoopConfiguration()), props);
+jssc.hadoopConfiguration(), props);
   }
 
-  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
HiveConf hiveConf,
- TypedProperties properties) throws IOException {
-this.cfg = cfg;
-this.deltaSyncService = new DeltaSyncService(cfg, jssc, fs, hiveConf, 
properties);
+  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
Configuration hiveConf) throws IOException {
+this(cfg, jssc, fs, hiveConf, null);
   }
 
-  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
HiveConf hiveConf) throws IOException {
+  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
Configuration hiveConf,
+ TypedProperties properties) throws IOException {
+if (cfg.initialCheckpointProvider != null && cfg.bootstrapFromPath != null 
&& cfg.checkpoint == null) {
 
 Review comment:
   same as above, it will need a code refactoring if we wanna get rid of null


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-10 Thread GitBox
garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate 
checkpoint privoder with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#discussion_r406617505
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -90,35 +90,33 @@
 
   public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc) throws 
IOException {
 this(cfg, jssc, FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration()),
-getDefaultHiveConf(jssc.hadoopConfiguration()));
+jssc.hadoopConfiguration(), null);
 
 Review comment:
   unfortunately, the `null` was served as a flag in many places in the delta 
streamer, because the command-line tool will produce `null`. If we want to 
change this, it might need a code refactoring.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint privoder with delta streamer

2020-04-10 Thread GitBox
garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint privoder 
with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#issuecomment-611891832
 
 
   hmm... Looks like the checkstyle auto-fix something...Let me see what's 
going...


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services