[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357505855
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##
 @@ -98,26 +100,34 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
   String recordKey = partitionRecordKeyPair._2();
   String partitionPath = partitionRecordKeyPair._1();
 
-  return indexFileFilter.getMatchingFiles(partitionPath, 
recordKey).stream()
-  .map(file -> new Tuple2<>(file, new HoodieKey(recordKey, 
indexToPartitionMap.get(file
+  return indexFileFilter.getMatchingFilesAndPartition(partitionPath, 
recordKey).stream()
+  .map(partitionFileIdPair -> new 
Tuple2<>(partitionFileIdPair.getRight(),
+  new HoodieKey(recordKey, partitionFileIdPair.getLeft(
   .collect(Collectors.toList());
 }).flatMap(List::iterator);
   }
 
-
   /**
* Tagging for global index should only consider the record key.
*/
   @Override
   protected JavaRDD> tagLocationBacktoRecords(
   JavaPairRDD keyFilenamePairRDD, 
JavaRDD> recordRDD) {
-JavaPairRDD> rowKeyRecordPairRDD =
-recordRDD.mapToPair(record -> new Tuple2<>(record.getRecordKey(), 
record));
 
-// Here as the recordRDD might have more data than rowKeyRDD (some 
rowKeys' fileId is null),
-// so we do left outer join.
-return rowKeyRecordPairRDD.leftOuterJoin(keyFilenamePairRDD.mapToPair(p -> 
new Tuple2<>(p._1.getRecordKey(), p._2)))
-.values().map(value -> getTaggedRecord(value._1, 
Option.ofNullable(value._2.orNull(;
+Map> keyMap = 
keyFilenamePairRDD
 
 Review comment:
   Thanks @bschell. That works. Will wait for @vinothchandar to comment. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-259) Hadoop 3 support for Hudi writing

2019-12-12 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995398#comment-16995398
 ] 

Pratyaksh Sharma commented on HUDI-259:
---

[~garyli1019] This is still under progress and the work is not yet complete. 
However, please let me know which modules are you facing issues, I can try to 
help. 

> Hadoop 3 support for Hudi writing
> -
>
> Key: HUDI-259
> URL: https://issues.apache.org/jira/browse/HUDI-259
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> Sample issues
>  
> [https://github.com/apache/incubator-hudi/issues/735]
> [https://github.com/apache/incubator-hudi/issues/877#issuecomment-528433568] 
> [https://github.com/apache/incubator-hudi/issues/898]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357501268
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##
 @@ -98,26 +100,34 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
   String recordKey = partitionRecordKeyPair._2();
   String partitionPath = partitionRecordKeyPair._1();
 
-  return indexFileFilter.getMatchingFiles(partitionPath, 
recordKey).stream()
-  .map(file -> new Tuple2<>(file, new HoodieKey(recordKey, 
indexToPartitionMap.get(file
+  return indexFileFilter.getMatchingFilesAndPartition(partitionPath, 
recordKey).stream()
+  .map(partitionFileIdPair -> new 
Tuple2<>(partitionFileIdPair.getRight(),
+  new HoodieKey(recordKey, partitionFileIdPair.getLeft(
   .collect(Collectors.toList());
 }).flatMap(List::iterator);
   }
 
-
   /**
* Tagging for global index should only consider the record key.
*/
   @Override
   protected JavaRDD> tagLocationBacktoRecords(
   JavaPairRDD keyFilenamePairRDD, 
JavaRDD> recordRDD) {
-JavaPairRDD> rowKeyRecordPairRDD =
-recordRDD.mapToPair(record -> new Tuple2<>(record.getRecordKey(), 
record));
 
-// Here as the recordRDD might have more data than rowKeyRDD (some 
rowKeys' fileId is null),
-// so we do left outer join.
-return rowKeyRecordPairRDD.leftOuterJoin(keyFilenamePairRDD.mapToPair(p -> 
new Tuple2<>(p._1.getRecordKey(), p._2)))
-.values().map(value -> getTaggedRecord(value._1, 
Option.ofNullable(value._2.orNull(;
+Map> keyMap = 
keyFilenamePairRDD
+.mapToPair(e -> new Tuple2<>(e._1.getRecordKey(), Pair.of(e._1, 
e._2))).collectAsMap();
+
+JavaRDD> toReturn = recordRDD.map(rec -> {
+  if (keyMap.containsKey(rec.getRecordKey())) {
+Pair value = 
keyMap.get(rec.getRecordKey());
+return getTaggedRecord(new HoodieRecord(value.getLeft(), 
rec.getData()), Option.of(value
 
 Review comment:
   @vinothchandar : yes, you are right. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357501183
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##
 @@ -98,26 +100,34 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
   String recordKey = partitionRecordKeyPair._2();
   String partitionPath = partitionRecordKeyPair._1();
 
-  return indexFileFilter.getMatchingFiles(partitionPath, 
recordKey).stream()
-  .map(file -> new Tuple2<>(file, new HoodieKey(recordKey, 
indexToPartitionMap.get(file
+  return indexFileFilter.getMatchingFilesAndPartition(partitionPath, 
recordKey).stream()
+  .map(partitionFileIdPair -> new 
Tuple2<>(partitionFileIdPair.getRight(),
+  new HoodieKey(recordKey, partitionFileIdPair.getLeft(
   .collect(Collectors.toList());
 }).flatMap(List::iterator);
   }
 
-
   /**
* Tagging for global index should only consider the record key.
*/
   @Override
   protected JavaRDD> tagLocationBacktoRecords(
   JavaPairRDD keyFilenamePairRDD, 
JavaRDD> recordRDD) {
-JavaPairRDD> rowKeyRecordPairRDD =
-recordRDD.mapToPair(record -> new Tuple2<>(record.getRecordKey(), 
record));
 
-// Here as the recordRDD might have more data than rowKeyRDD (some 
rowKeys' fileId is null),
-// so we do left outer join.
-return rowKeyRecordPairRDD.leftOuterJoin(keyFilenamePairRDD.mapToPair(p -> 
new Tuple2<>(p._1.getRecordKey(), p._2)))
-.values().map(value -> getTaggedRecord(value._1, 
Option.ofNullable(value._2.orNull(;
+Map> keyMap = 
keyFilenamePairRDD
+.mapToPair(e -> new Tuple2<>(e._1.getRecordKey(), Pair.of(e._1, 
e._2))).collectAsMap();
+
+JavaRDD> toReturn = recordRDD.map(rec -> {
+  if (keyMap.containsKey(rec.getRecordKey())) {
+Pair value = 
keyMap.get(rec.getRecordKey());
+return getTaggedRecord(new HoodieRecord(value.getLeft(), 
rec.getData()), Option.of(value
 
 Review comment:
   yes, you are right. keyMap contains existing record key information 
(HoodieKey, HoodieRecordLocation) in existing data set . For each incoming 
record to be upserted, if record key is located already, we need to upsert to 
the same partition even if incoming record (to be updated) was tagged with a 
different partition.And so, we reconstruct the HoodieKey with recordKey from 
incoming record and partition path from keyMap(index look up).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357500105
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -304,6 +304,24 @@ public HoodieRecord generateUpdateRecord(HoodieKey key, 
String commitTime) throw
 return updates;
   }
 
+  public List generateUpdatesWithDiffPartition(String 
commitTime, List baseRecords)
 
 Review comment:
   this method uses some variables in HoodieTestDataGenerator. if we move this 
to the test class, then have to pass in DataGenerator class and hence left it 
here. If you still persist, I can move it. 
   
   Yes, you are right. To choose the new partition chose, I alter between two 
partitions. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar merged pull request #1103: [MINOR] Unify Lists import

2019-12-12 Thread GitBox
vinothchandar merged pull request #1103: [MINOR] Unify Lists import
URL: https://github.com/apache/incubator-hudi/pull/1103
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [MINOR] Unify Lists import (#1103)

2019-12-12 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f324057  [MINOR] Unify Lists import (#1103)
f324057 is described below

commit f324057b6e27075833673b09b5c92d9c53ed2b58
Author: lamber-ken 
AuthorDate: Fri Dec 13 13:42:18 2019 +0800

[MINOR] Unify Lists import (#1103)
---
 .../org/apache/hudi/hive/SlashEncodedDayPartitionValueExtractor.java | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git 
a/hudi-hive/src/main/java/org/apache/hudi/hive/SlashEncodedDayPartitionValueExtractor.java
 
b/hudi-hive/src/main/java/org/apache/hudi/hive/SlashEncodedDayPartitionValueExtractor.java
index 2ecad28..4dba0fe 100644
--- 
a/hudi-hive/src/main/java/org/apache/hudi/hive/SlashEncodedDayPartitionValueExtractor.java
+++ 
b/hudi-hive/src/main/java/org/apache/hudi/hive/SlashEncodedDayPartitionValueExtractor.java
@@ -18,11 +18,11 @@
 
 package org.apache.hudi.hive;
 
-import com.beust.jcommander.internal.Lists;
 import org.joda.time.DateTime;
 import org.joda.time.format.DateTimeFormat;
 import org.joda.time.format.DateTimeFormatter;
 
+import java.util.Collections;
 import java.util.List;
 
 /**
@@ -58,6 +58,7 @@ public class SlashEncodedDayPartitionValueExtractor 
implements PartitionValueExt
 int mm = Integer.parseInt(splits[1].contains("=") ? 
splits[1].split("=")[1] : splits[1]);
 int dd = Integer.parseInt(splits[2].contains("=") ? 
splits[2].split("=")[1] : splits[2]);
 DateTime dateTime = new DateTime(year, mm, dd, 0, 0);
-return Lists.newArrayList(getDtfOut().print(dateTime));
+
+return Collections.singletonList(getDtfOut().print(dateTime));
   }
 }



[jira] [Assigned] (HUDI-405) Fix sync no hive partition at first time

2019-12-12 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-405:
---

Assignee: lamber-ken

> Fix sync no hive partition at first time
> 
>
> Key: HUDI-405
> URL: https://issues.apache.org/jira/browse/HUDI-405
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> If user custom the partition extractor, HiveSyncTool sync no partition at the 
> first commit.
> ISSUE: [https://github.com/apache/incubator-hudi/issues/828]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-405) Fix syncronizing no hive partition at first time

2019-12-12 Thread lamber-ken (Jira)
lamber-ken created HUDI-405:
---

 Summary: Fix syncronizing no hive partition at first time
 Key: HUDI-405
 URL: https://issues.apache.org/jira/browse/HUDI-405
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: lamber-ken


If user custom the partition extractor, HiveSyncTool sync no partition at the 
first commit.

ISSUE: [https://github.com/apache/incubator-hudi/issues/828]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-405) Fix sync no hive partition at first time

2019-12-12 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-405:

Summary: Fix sync no hive partition at first time  (was: Fix syncronizing 
no hive partition at first time)

> Fix sync no hive partition at first time
> 
>
> Key: HUDI-405
> URL: https://issues.apache.org/jira/browse/HUDI-405
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: lamber-ken
>Priority: Major
>
> If user custom the partition extractor, HiveSyncTool sync no partition at the 
> first commit.
> ISSUE: [https://github.com/apache/incubator-hudi/issues/828]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] wojustme opened a new pull request #1104: [HUDI-404] fix the error of compiling project.

2019-12-12 Thread GitBox
wojustme opened a new pull request #1104: [HUDI-404] fix the error of compiling 
project.
URL: https://github.com/apache/incubator-hudi/pull/1104
 
 
   When I compile the current project, but it throws some error, as 
[jira](https://issues.apache.org/jira/projects/HUDI/issues/HUDI-404) described


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-404) Compile Master's Source Code Error

2019-12-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-404:

Labels: pull-request-available  (was: )

> Compile Master's Source Code  Error
> ---
>
> Key: HUDI-404
> URL: https://issues.apache.org/jira/browse/HUDI-404
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Xurenhe
>Priority: Major
>  Labels: pull-request-available
> Attachments: hudi-compile-error.png
>
>
> Hi, I download source code of Hudi, when I use command of 'mvn clean package 
> -DskipTests -DskipITs' to compile this project, but some error happened.
> I check the maven's dependencies, I find miss one dependency of 
> 'com.google.code.findbugs:jsr305'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-404) Compile Master's Source Code Error

2019-12-12 Thread Xurenhe (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xurenhe updated HUDI-404:
-
Attachment: hudi-compile-error.png

> Compile Master's Source Code  Error
> ---
>
> Key: HUDI-404
> URL: https://issues.apache.org/jira/browse/HUDI-404
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Xurenhe
>Priority: Major
> Attachments: hudi-compile-error.png
>
>
> Hi, I download source code of Hudi, when I use command of 'mvn clean package 
> -DskipTests -DskipITs' to compile this project, but some error happened.
> I check the maven's dependencies, I find miss one dependency of 
> 'com.google.code.findbugs:jsr305'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-404) Compile Master's Source Code Error

2019-12-12 Thread Xurenhe (Jira)
Xurenhe created HUDI-404:


 Summary: Compile Master's Source Code  Error
 Key: HUDI-404
 URL: https://issues.apache.org/jira/browse/HUDI-404
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Xurenhe


Hi, I download source code of Hudi, when I use command of 'mvn clean package 
-DskipTests -DskipITs' to compile this project, but some error happened.
I check the maven's dependencies, I find miss one dependency of 
'com.google.code.findbugs:jsr305'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #127

2019-12-12 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.13 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark [jar]
[INFO] hudi-utilities [jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle  [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle  [jar]
[INFO] hudi-timeline-server-bundle

[GitHub] [incubator-hudi] Akshay2Agarwal commented on issue #143: Tracking ticket for folks to be added to slack group

2019-12-12 Thread GitBox
Akshay2Agarwal commented on issue #143: Tracking ticket for folks to be added 
to slack group
URL: https://github.com/apache/incubator-hudi/issues/143#issuecomment-565288931
 
 
   Please add me to the group: akshay.agar...@grofers.com. Thanks in advance ! 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1102: [HUDI-391] Rename module name from hudi-bench to hudi-test-suite

2019-12-12 Thread GitBox
yanghua commented on issue #1102: [HUDI-391] Rename module name from hudi-bench 
to hudi-test-suite
URL: https://github.com/apache/incubator-hudi/pull/1102#issuecomment-565288187
 
 
   @n3nash 
   
   > LGTM to me, couple of things to do before merging :
   > 
   > 1. Can you do a search of `hudi-bench` string in the module and make sure 
no comments are lingering that have this.
   
   Yes, I have searched `hudi-bench` keyword and replaced all of them, there is 
nothing left.
   
   > 2. The diff somehow doesn't show the module name change, I'm guessing 
that's just github and you've already made that ?
   
   Yes. I have renamed all of them, after merging, you can not find anything if 
you search `bench` keyword.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong commented on issue #1096: [HUDI-398]Add set env for spark launcher

2019-12-12 Thread GitBox
hddong commented on issue #1096: [HUDI-398]Add set env for spark launcher
URL: https://github.com/apache/incubator-hudi/pull/1096#issuecomment-565282623
 
 
   @leesf thanks, all be updated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
n3nash commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357463167
 
 

 ##
 File path: hudi-client/src/test/java/org/apache/hudi/index/TestHbaseIndex.java
 ##
 @@ -168,13 +169,19 @@ public void testTagLocationAndDuplicateUpdate() throws 
Exception {
 HoodieWriteConfig config = getConfig();
 HBaseIndex index = new HBaseIndex(config);
 HoodieWriteClient writeClient = new HoodieWriteClient(jsc, config);
-writeClient.startCommit();
+writeClient.startCommitWithTime(newCommitTime);
 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieTable hoodieTable = HoodieTable.getHoodieTable(metaClient, config, 
jsc);
 
 JavaRDD writeStatues = writeClient.upsert(writeRecords, 
newCommitTime);
 JavaRDD javaRDD1 = index.tagLocation(writeRecords, jsc, 
hoodieTable);
+
 // Duplicate upsert and ensure correctness is maintained
+// We are trying to approximately imitate the case when the RRD is 
recomputed. For RRD creating, driver code is not
 
 Review comment:
   Discussed this f2f and resolved my doubts


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1102: [HUDI-391] Rename module name from hudi-bench to hudi-test-suite

2019-12-12 Thread GitBox
n3nash commented on issue #1102: [HUDI-391] Rename module name from hudi-bench 
to hudi-test-suite
URL: https://github.com/apache/incubator-hudi/pull/1102#issuecomment-565275253
 
 
   LGTM to me, couple of things to do before merging : 
   
   1) Can you do a search of `hudi-bench` string in the module and make sure no 
comments are lingering that have this.
   2) The diff somehow doesn't show the module name change, I'm guessing that's 
just github and you've already made that ?
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on a change in pull request #1096: [HUDI-398]Add set env for spark launcher

2019-12-12 Thread GitBox
leesf commented on a change in pull request #1096: [HUDI-398]Add set env for 
spark launcher
URL: https://github.com/apache/incubator-hudi/pull/1096#discussion_r357401784
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/SetSparkEnvCommand.java
 ##
 @@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.cli.HoodiePrintHelper;
+
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * CLI command to set env.
 
 Review comment:
   to set and get env. ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-259) Hadoop 3 support for Hudi writing

2019-12-12 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995153#comment-16995153
 ] 

Yanjia Gary Li commented on HUDI-259:
-

Hello, I recently started using Hadoop 3 and Spark 2.4. 
[https://github.com/apache/incubator-hudi/commit/7bc08cbfdce337ad980bb544ec9fc3dbdf9c#diff-832156391e3edd5b0ceb86007ce6ae41]
 enable me to compile Hudi with Hadoop 3, but some tests are failed. 

> Hadoop 3 support for Hudi writing
> -
>
> Key: HUDI-259
> URL: https://issues.apache.org/jira/browse/HUDI-259
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> Sample issues
>  
> [https://github.com/apache/incubator-hudi/issues/735]
> [https://github.com/apache/incubator-hudi/issues/877#issuecomment-528433568] 
> [https://github.com/apache/incubator-hudi/issues/898]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf commented on a change in pull request #1096: [HUDI-398]Add set env for spark launcher

2019-12-12 Thread GitBox
leesf commented on a change in pull request #1096: [HUDI-398]Add set env for 
spark launcher
URL: https://github.com/apache/incubator-hudi/pull/1096#discussion_r357399619
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/SetSparkEnvCommand.java
 ##
 @@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.cli.HoodiePrintHelper;
+
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * CLI command to set env.
+ */
+@Component
+public class SetSparkEnvCommand implements CommandMarker {
 
 Review comment:
   How about renaming to SparkEnvCommand as it also supports show command?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-242) Support Efficient bootstrap of large parquet datasets to Hudi

2019-12-12 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-242:

Fix Version/s: 0.5.1

> Support Efficient bootstrap of large parquet datasets to Hudi
> -
>
> Key: HUDI-242
> URL: https://issues.apache.org/jira/browse/HUDI-242
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
>
>  Support Efficient bootstrap of large parquet tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-242) Support Seamless bootstrap of legacy datasets to Hudi

2019-12-12 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-242:
---

Assignee: Balaji Varadarajan

> Support Seamless bootstrap of legacy datasets to Hudi
> -
>
> Key: HUDI-242
> URL: https://issues.apache.org/jira/browse/HUDI-242
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>
>  Support Efficient bootstrap of large parquet tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-242) Support Efficient bootstrap of large parquet datasets to Hudi

2019-12-12 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-242:

Summary: Support Efficient bootstrap of large parquet datasets to Hudi  
(was: Support Seamless bootstrap of legacy datasets to Hudi)

> Support Efficient bootstrap of large parquet datasets to Hudi
> -
>
> Key: HUDI-242
> URL: https://issues.apache.org/jira/browse/HUDI-242
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>
>  Support Efficient bootstrap of large parquet tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-242) Support Seamless bootstrap of legacy datasets to Hudi

2019-12-12 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-242:

Description:  Support Efficient bootstrap of large parquet tables  (was: 
Build support to allow lazy bootstrap of exiting datasets to Hudi. Older 
partitions left untouched till any mutations arrive. When mutations arrive, the 
files are then migrated transparently to Hudi format. This is a sort of lazy 
bootstrapping.

 )

> Support Seamless bootstrap of legacy datasets to Hudi
> -
>
> Key: HUDI-242
> URL: https://issues.apache.org/jira/browse/HUDI-242
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Priority: Major
>
>  Support Efficient bootstrap of large parquet tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bschell commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
bschell commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357361788
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##
 @@ -85,6 +86,7 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
   JavaRDD> explodeRecordRDDWithFileComparisons(
   final Map> partitionToFileIndexInfo,
   JavaPairRDD partitionRecordKeyPairRDD) {
+
 Map indexToPartitionMap = new HashMap<>();
 for (Entry> entry : 
partitionToFileIndexInfo.entrySet()) {
   entry.getValue().forEach(indexFile -> 
indexToPartitionMap.put(indexFile.getFileId(), entry.getKey()));
 
 Review comment:
   This whole "indexToPartitionMap" can be deleted now right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bschell commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
bschell commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357366421
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/TestHoodieClientOnCopyOnWriteStorage.java
 ##
 @@ -336,7 +335,54 @@ public void testDeletes() throws Exception {
   }
 
   /**
-   * Test scenario of new file-group getting added during upsert().
+   * Test update of a record to different partition with Global Index
+   */
+  @Test
+  public void testUpsertToDiffPartitionGlobaIndex() throws Exception {
 
 Review comment:
   [nit] testUpsertToDiffPartitionGlobaIndex -> 
testUpsertToDiffPartitionGlobalIndex


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bschell commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
bschell commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357365653
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/ListBasedGlobalIndexFileFilter.java
 ##
 @@ -35,17 +36,14 @@
   }
 
   @Override
-  public Set getMatchingFiles(String partitionPath, String recordKey) {
-Set toReturn = new HashSet<>();
-partitionToFileIndexInfo.values().forEach(indexInfos -> {
-  if (indexInfos != null) { // could be null, if there are no files in a 
given partition yet.
-// for each candidate file in partition, that needs to be compared.
-for (BloomIndexFileInfo indexInfo : indexInfos) {
-  if (shouldCompareWithFile(indexInfo, recordKey)) {
-toReturn.add(indexInfo.getFileId());
-  }
+  public Set> getMatchingFilesAndPartition(String 
partitionPath, String recordKey) {
+Set> toReturn = new HashSet<>();
+partitionToFileIndexInfo.forEach((parition, bloomIndexFileInfoList) -> {
 
 Review comment:
   [nit] parition -> partition


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bschell commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
bschell commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357363234
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##
 @@ -98,26 +100,34 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
   String recordKey = partitionRecordKeyPair._2();
   String partitionPath = partitionRecordKeyPair._1();
 
-  return indexFileFilter.getMatchingFiles(partitionPath, 
recordKey).stream()
-  .map(file -> new Tuple2<>(file, new HoodieKey(recordKey, 
indexToPartitionMap.get(file
+  return indexFileFilter.getMatchingFilesAndPartition(partitionPath, 
recordKey).stream()
+  .map(partitionFileIdPair -> new 
Tuple2<>(partitionFileIdPair.getRight(),
+  new HoodieKey(recordKey, partitionFileIdPair.getLeft(
   .collect(Collectors.toList());
 }).flatMap(List::iterator);
   }
 
-
   /**
* Tagging for global index should only consider the record key.
*/
   @Override
   protected JavaRDD> tagLocationBacktoRecords(
   JavaPairRDD keyFilenamePairRDD, 
JavaRDD> recordRDD) {
-JavaPairRDD> rowKeyRecordPairRDD =
-recordRDD.mapToPair(record -> new Tuple2<>(record.getRecordKey(), 
record));
 
-// Here as the recordRDD might have more data than rowKeyRDD (some 
rowKeys' fileId is null),
-// so we do left outer join.
-return rowKeyRecordPairRDD.leftOuterJoin(keyFilenamePairRDD.mapToPair(p -> 
new Tuple2<>(p._1.getRecordKey(), p._2)))
-.values().map(value -> getTaggedRecord(value._1, 
Option.ofNullable(value._2.orNull(;
+Map> keyMap = 
keyFilenamePairRDD
+.mapToPair(e -> new Tuple2<>(e._1.getRecordKey(), Pair.of(e._1, 
e._2))).collectAsMap();
+
+JavaRDD> toReturn = recordRDD.map(rec -> {
+  if (keyMap.containsKey(rec.getRecordKey())) {
+Pair value = 
keyMap.get(rec.getRecordKey());
+return getTaggedRecord(new HoodieRecord(value.getLeft(), 
rec.getData()), Option.of(value
 
 Review comment:
   That is what is happening here right? This logic only occurs when the 
record_key is already matching so we are essentially replacing just the 
partitionPath when we replace the HoodieKey.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bschell commented on issue #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
bschell commented on issue #1091: [HUDI-389] Fixing Index look up to return 
right partitions for a given key along with fileId with Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#issuecomment-565170220
 
 
   Added a workaround for collectAsMap. Will continue review today.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bschell commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
bschell commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357343158
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##
 @@ -98,26 +100,34 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
   String recordKey = partitionRecordKeyPair._2();
   String partitionPath = partitionRecordKeyPair._1();
 
-  return indexFileFilter.getMatchingFiles(partitionPath, 
recordKey).stream()
-  .map(file -> new Tuple2<>(file, new HoodieKey(recordKey, 
indexToPartitionMap.get(file
+  return indexFileFilter.getMatchingFilesAndPartition(partitionPath, 
recordKey).stream()
+  .map(partitionFileIdPair -> new 
Tuple2<>(partitionFileIdPair.getRight(),
+  new HoodieKey(recordKey, partitionFileIdPair.getLeft(
   .collect(Collectors.toList());
 }).flatMap(List::iterator);
   }
 
-
   /**
* Tagging for global index should only consider the record key.
*/
   @Override
   protected JavaRDD> tagLocationBacktoRecords(
   JavaPairRDD keyFilenamePairRDD, 
JavaRDD> recordRDD) {
-JavaPairRDD> rowKeyRecordPairRDD =
-recordRDD.mapToPair(record -> new Tuple2<>(record.getRecordKey(), 
record));
 
-// Here as the recordRDD might have more data than rowKeyRDD (some 
rowKeys' fileId is null),
-// so we do left outer join.
-return rowKeyRecordPairRDD.leftOuterJoin(keyFilenamePairRDD.mapToPair(p -> 
new Tuple2<>(p._1.getRecordKey(), p._2)))
-.values().map(value -> getTaggedRecord(value._1, 
Option.ofNullable(value._2.orNull(;
+Map> keyMap = 
keyFilenamePairRDD
 
 Review comment:
   What about something like this instead? This will avoid collecting and 
maintains the same operations to what already exists.
   
   
[taglocation.txt](https://github.com/apache/incubator-hudi/files/3957588/taglocation.txt)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1103: [MINOR] Unify Lists import

2019-12-12 Thread GitBox
lamber-ken opened a new pull request #1103: [MINOR] Unify Lists import
URL: https://github.com/apache/incubator-hudi/pull/1103
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   When I debug this issue https://github.com/apache/incubator-hudi/issues/828 
, met `java.lang.NoClassDefFoundError`.
   
   ```
   java.lang.NoClassDefFoundError: com/beust/jcommander/internal/Lists
 at 
org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:61)
 at 
org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:232)
 at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:169)
 at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:113)
 at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:69)
 at 
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:321)
   ```
   
   ## Brief change log
   
 - Unify Lists import
   
   ## Verify this pull request
   
   This pull request is code cleanup without any test coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing 
Index look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357202270
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/TestHoodieClientOnCopyOnWriteStorage.java
 ##
 @@ -336,7 +335,54 @@ public void testDeletes() throws Exception {
   }
 
   /**
-   * Test scenario of new file-group getting added during upsert().
+   * Test update of a record to different partition with Global Index
+   */
+  @Test
+  public void testUpsertToDiffPartitionGlobaIndex() throws Exception {
+HoodieWriteClient client = 
getHoodieWriteClient(getConfig(IndexType.GLOBAL_BLOOM), false);
+/**
+ * Write 1 (inserts and deletes) Write actual 200 insert records and 
ignore 100 delete records
+ */
+String initCommitTime = "000";
+String newCommitTime = "001";
+List inserts1 = dataGen.generateInserts(newCommitTime, 10);
+
+// Write 1 (only inserts)
+client.startCommitWithTime(newCommitTime);
+JavaRDD writeRecords = jsc.parallelize(inserts1, 1);
+System.out.println("Records to be inserted " + writeRecords.collect());
+
+JavaRDD result = client.insert(writeRecords, newCommitTime);
+List statuses = result.collect();
+assertNoWriteErrors(statuses);
+
+// check the partition metadata is written out
+assertPartitionMetadata(HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS, 
fs);
+
+/**
+ * Write 2 updates
+ */
+String prevCommitTime = newCommitTime;
+newCommitTime = "004";
+
+// Write 1 (only inserts)
+client.startCommitWithTime(newCommitTime);
+
+List updates1 = 
dataGen.generateUpdatesWithDiffPartition(newCommitTime, inserts1);
+JavaRDD updateRecords = jsc.parallelize(updates1, 1);
+
+System.out.println("Records to be updated " + updateRecords.collect());
 
 Review comment:
   log?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing 
Index look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357212983
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/IntervalTreeBasedGlobalIndexFileFilter.java
 ##
 @@ -58,10 +67,12 @@
   }
 
   @Override
-  public Set getMatchingFiles(String partitionPath, String recordKey) {
-Set toReturn = new HashSet<>();
-toReturn.addAll(indexLookUpTree.getMatchingIndexFiles(recordKey));
-toReturn.addAll(filesWithNoRanges);
+  public Set> getMatchingFilesAndPartition(String 
partitionPath, String recordKey) {
+Set matchingFiles = new HashSet<>();
 
 Review comment:
   can we try to use java8 collect() APIs to implement these.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing 
Index look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357201740
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -304,6 +304,24 @@ public HoodieRecord generateUpdateRecord(HoodieKey key, 
String commitTime) throw
 return updates;
   }
 
+  public List generateUpdatesWithDiffPartition(String 
commitTime, List baseRecords)
 
 Review comment:
   lets keep this with the actual test that uses this? only thing that is 
needed here seems like the `partitionPaths`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing 
Index look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357202347
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/TestHoodieClientOnCopyOnWriteStorage.java
 ##
 @@ -336,7 +335,54 @@ public void testDeletes() throws Exception {
   }
 
   /**
-   * Test scenario of new file-group getting added during upsert().
+   * Test update of a record to different partition with Global Index
+   */
+  @Test
+  public void testUpsertToDiffPartitionGlobaIndex() throws Exception {
+HoodieWriteClient client = 
getHoodieWriteClient(getConfig(IndexType.GLOBAL_BLOOM), false);
+/**
+ * Write 1 (inserts and deletes) Write actual 200 insert records and 
ignore 100 delete records
+ */
+String initCommitTime = "000";
+String newCommitTime = "001";
+List inserts1 = dataGen.generateInserts(newCommitTime, 10);
+
+// Write 1 (only inserts)
+client.startCommitWithTime(newCommitTime);
+JavaRDD writeRecords = jsc.parallelize(inserts1, 1);
+System.out.println("Records to be inserted " + writeRecords.collect());
 
 Review comment:
   log?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing 
Index look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357210650
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/TestHoodieClientOnCopyOnWriteStorage.java
 ##
 @@ -336,7 +335,54 @@ public void testDeletes() throws Exception {
   }
 
   /**
-   * Test scenario of new file-group getting added during upsert().
+   * Test update of a record to different partition with Global Index
+   */
+  @Test
+  public void testUpsertToDiffPartitionGlobaIndex() throws Exception {
 
 Review comment:
   we should be able to test this functionality at the index level right? More 
test cases here keep adding to the test runtime... either way can we tst at the 
global index level?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing 
Index look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357207910
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##
 @@ -98,26 +100,34 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
   String recordKey = partitionRecordKeyPair._2();
   String partitionPath = partitionRecordKeyPair._1();
 
-  return indexFileFilter.getMatchingFiles(partitionPath, 
recordKey).stream()
-  .map(file -> new Tuple2<>(file, new HoodieKey(recordKey, 
indexToPartitionMap.get(file
+  return indexFileFilter.getMatchingFilesAndPartition(partitionPath, 
recordKey).stream()
+  .map(partitionFileIdPair -> new 
Tuple2<>(partitionFileIdPair.getRight(),
+  new HoodieKey(recordKey, partitionFileIdPair.getLeft(
   .collect(Collectors.toList());
 }).flatMap(List::iterator);
   }
 
-
   /**
* Tagging for global index should only consider the record key.
*/
   @Override
   protected JavaRDD> tagLocationBacktoRecords(
   JavaPairRDD keyFilenamePairRDD, 
JavaRDD> recordRDD) {
-JavaPairRDD> rowKeyRecordPairRDD =
-recordRDD.mapToPair(record -> new Tuple2<>(record.getRecordKey(), 
record));
 
-// Here as the recordRDD might have more data than rowKeyRDD (some 
rowKeys' fileId is null),
-// so we do left outer join.
-return rowKeyRecordPairRDD.leftOuterJoin(keyFilenamePairRDD.mapToPair(p -> 
new Tuple2<>(p._1.getRecordKey(), p._2)))
-.values().map(value -> getTaggedRecord(value._1, 
Option.ofNullable(value._2.orNull(;
+Map> keyMap = 
keyFilenamePairRDD
 
 Review comment:
   this will bring all keys into memory and OOM the driver.. we need to rethink 
this. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1091: [HUDI-389] Fixing 
Index look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r357208499
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##
 @@ -98,26 +100,34 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
   String recordKey = partitionRecordKeyPair._2();
   String partitionPath = partitionRecordKeyPair._1();
 
-  return indexFileFilter.getMatchingFiles(partitionPath, 
recordKey).stream()
-  .map(file -> new Tuple2<>(file, new HoodieKey(recordKey, 
indexToPartitionMap.get(file
+  return indexFileFilter.getMatchingFilesAndPartition(partitionPath, 
recordKey).stream()
+  .map(partitionFileIdPair -> new 
Tuple2<>(partitionFileIdPair.getRight(),
+  new HoodieKey(recordKey, partitionFileIdPair.getLeft(
   .collect(Collectors.toList());
 }).flatMap(List::iterator);
   }
 
-
   /**
* Tagging for global index should only consider the record key.
*/
   @Override
   protected JavaRDD> tagLocationBacktoRecords(
   JavaPairRDD keyFilenamePairRDD, 
JavaRDD> recordRDD) {
-JavaPairRDD> rowKeyRecordPairRDD =
-recordRDD.mapToPair(record -> new Tuple2<>(record.getRecordKey(), 
record));
 
-// Here as the recordRDD might have more data than rowKeyRDD (some 
rowKeys' fileId is null),
-// so we do left outer join.
-return rowKeyRecordPairRDD.leftOuterJoin(keyFilenamePairRDD.mapToPair(p -> 
new Tuple2<>(p._1.getRecordKey(), p._2)))
-.values().map(value -> getTaggedRecord(value._1, 
Option.ofNullable(value._2.orNull(;
+Map> keyMap = 
keyFilenamePairRDD
+.mapToPair(e -> new Tuple2<>(e._1.getRecordKey(), Pair.of(e._1, 
e._2))).collectAsMap();
+
+JavaRDD> toReturn = recordRDD.map(rec -> {
+  if (keyMap.containsKey(rec.getRecordKey())) {
+Pair value = 
keyMap.get(rec.getRecordKey());
+return getTaggedRecord(new HoodieRecord(value.getLeft(), 
rec.getData()), Option.of(value
 
 Review comment:
   All we need to do is change the partitionPath in the tagged record right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r357194820
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilter.java
 ##
 @@ -16,34 +16,35 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.common;
-
-import org.junit.Test;
-
-import java.io.IOException;
+package org.apache.hudi.common.bloom.filter;
 
 /**
- * Tests bloom filter {@link BloomFilter}.
+ * A Bloom filter interface
  */
-public class TestBloomFilter {
+public interface BloomFilter {
 
 Review comment:
   its funny how git thinks this is a a file move.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-12 Thread GitBox
vinothchandar commented on issue #976: [HUDI-106] Adding support for 
DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-565048509
 
 
   We can merge once these are addressed.. Lets get this across the line! :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r357197955
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java
 ##
 @@ -29,6 +29,7 @@
 import org.apache.hudi.common.versioning.clean.CleanV2MigrationHandler;
 
 public class CleanerUtils {
+
 
 Review comment:
   nit:extra line


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r357197035
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/HoodieDynamicBoundedBloomFilter.java
 ##
 @@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bloom.filter;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import javax.xml.bind.DatatypeConverter;
+import org.apache.hadoop.util.bloom.Key;
+import org.apache.hudi.exception.HoodieIndexException;
+
+/**
+ * Hoodie's dynamic bloom bounded bloom filter
+ */
+public class HoodieDynamicBoundedBloomFilter extends LocalDynamicBloomFilter 
implements BloomFilter {
+
+  public static final String TYPE_CODE_PREFIX = "DYNAMIC";
 
 Review comment:
   Do we still need this field?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r357195468
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilterTypeCode.java
 ##
 @@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bloom.filter;
+
+/**
+ * Bloom filter type codes
+ */
+public enum BloomFilterTypeCode {
+  SIMPLE,
 
 Review comment:
   add a comment here, to warn people not to change this order.. (it will mess 
up the type decoding)?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357177698
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -87,11 +90,14 @@ public HoodieTableMetaClient(Configuration conf, String 
basePath) throws Dataset
   }
 
   public HoodieTableMetaClient(Configuration conf, String basePath, boolean 
loadActiveTimelineOnLoad) {
-this(conf, basePath, loadActiveTimelineOnLoad, 
ConsistencyGuardConfig.newBuilder().build());
+this(conf, basePath, loadActiveTimelineOnLoad, 
ConsistencyGuardConfig.newBuilder().build(),
+// Readers will use latest version
 
 Review comment:
   Even writers use `HoodieTableMetaClient` right? can you clarify this comment?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357158027
 
 

 ##
 File path: hudi-common/src/main/avro/HoodieArchivedMetaEntry.avsc
 ##
 @@ -74,6 +75,27 @@
  "name":"version",
  "type":["int", "null"],
  "default": 1
+  },
+  {
+ "name":"hoodieCompactionPlan",
+ "type":[
+"null",
+"HoodieCompactionPlan"
+ ],
+ "default": null
+  },
+  {
+ "name":"hoodieCleanerPlan",
+ "type":[
+"null",
+"HoodieCleanerPlan"
+ ],
+ "default": null
+  },
+  {
+ "name":"actionState",
 
 Review comment:
   is the instantTime generally available some place? (can also add 
`instantTime`, `actionType`) later


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357130439
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -61,11 +62,11 @@
 
   public static final SimpleDateFormat COMMIT_FORMATTER = new 
SimpleDateFormat("MMddHHmmss");
 
-  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE =
-  new HashSet<>(Arrays.asList(new String[] {COMMIT_EXTENSION, 
INFLIGHT_COMMIT_EXTENSION, DELTA_COMMIT_EXTENSION,
-  INFLIGHT_DELTA_COMMIT_EXTENSION, SAVEPOINT_EXTENSION, 
INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION,
-  INFLIGHT_CLEAN_EXTENSION, REQUESTED_CLEAN_EXTENSION, 
INFLIGHT_COMPACTION_EXTENSION,
-  REQUESTED_COMPACTION_EXTENSION, INFLIGHT_RESTORE_EXTENSION, 
RESTORE_EXTENSION}));
+  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE = new 
HashSet<>(Arrays.asList(
+  new String[]{COMMIT_EXTENSION, INFLIGHT_COMMIT_EXTENSION, 
REQUESTED_COMMIT_EXTENTSION, DELTA_COMMIT_EXTENSION,
+  INFLIGHT_DELTA_COMMIT_EXTENSION, REQUESTED_DELTA_COMMIT_EXTENTSION, 
SAVEPOINT_EXTENSION,
 
 Review comment:
   typo: REQUESTED_DELTA_COMMIT_EXTENSION


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357157168
 
 

 ##
 File path: hudi-common/src/main/avro/HoodieArchivedMetaEntry.avsc
 ##
 @@ -37,6 +37,7 @@
  "default": null
   },
   {
+ /** DEPRECATED **/
 
 Review comment:
   why is this deprecated? Only compactionPlan is sufficient?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357131724
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##
 @@ -48,6 +49,7 @@
 public class HoodieWriteConfig extends DefaultHoodieConfig {
 
   public static final String TABLE_NAME = "hoodie.table.name";
+  private static final String OVERRIDDEN_TIMELINE_LAYOUT_VERSION = 
"hoodie.timeline.layout.version";
 
 Review comment:
   simply `TIMELINE_LAYOUT_VERSION`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357130581
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -61,11 +62,11 @@
 
   public static final SimpleDateFormat COMMIT_FORMATTER = new 
SimpleDateFormat("MMddHHmmss");
 
-  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE =
-  new HashSet<>(Arrays.asList(new String[] {COMMIT_EXTENSION, 
INFLIGHT_COMMIT_EXTENSION, DELTA_COMMIT_EXTENSION,
-  INFLIGHT_DELTA_COMMIT_EXTENSION, SAVEPOINT_EXTENSION, 
INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION,
-  INFLIGHT_CLEAN_EXTENSION, REQUESTED_CLEAN_EXTENSION, 
INFLIGHT_COMPACTION_EXTENSION,
-  REQUESTED_COMPACTION_EXTENSION, INFLIGHT_RESTORE_EXTENSION, 
RESTORE_EXTENSION}));
+  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE = new 
HashSet<>(Arrays.asList(
+  new String[]{COMMIT_EXTENSION, INFLIGHT_COMMIT_EXTENSION, 
REQUESTED_COMMIT_EXTENTSION, DELTA_COMMIT_EXTENSION,
 
 Review comment:
   REQUESTED_COMMIT_EXTENSION


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357160890
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/TimelineLayoutVersion.java
 ##
 @@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import com.google.common.base.Preconditions;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * Metadata Layout Version. Add new version when timeline format changes
+ */
+public class TimelineLayoutVersion implements Serializable, 
Comparable {
+
+  public static final Integer VERSION_0 = 0; // pre 0.5.1  version format
+  public static final Integer VERSION_1 = 1; // current version with no renames
+
+  public static final Integer CURR_VERSION = VERSION_1;
+  public static final TimelineLayoutVersion LATEST_TIMELINE_LAYOUT_VERSION = 
new TimelineLayoutVersion(CURR_VERSION);
 
 Review comment:
   rename to `CURR_LAYOUT_VERSION`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357125859
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -792,23 +798,25 @@ public void restoreToInstant(final String instantTime) 
throws HoodieRollbackExce
 .filter(instant -> 
HoodieActiveTimeline.GREATER.test(instant.getTimestamp(), instantTime))
 .collect(Collectors.toList());
 // Start a rollback instant for all commits to be rolled back
-String startRollbackInstant = startInstant();
+String startRollbackInstant = HoodieActiveTimeline.createNewCommitTime();
 
 Review comment:
   rename: HoodieActiveTimeline.createNewInstantTime() ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357181681
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/TimelineLayout.java
 ##
 @@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table;
+
+import org.apache.hudi.common.model.TimelineLayoutVersion;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+
+/**
+ * Timeline Layout responsible for applying specific filters when generating 
timeline instants.
+ */
+public abstract class TimelineLayout implements Serializable {
+
+  private static final Map LAYOUT_MAP = 
new HashMap<>();
+
+  static {
+LAYOUT_MAP.put(new TimelineLayoutVersion(TimelineLayoutVersion.VERSION_0), 
new TimelineLayoutV0());
+LAYOUT_MAP.put(new TimelineLayoutVersion(TimelineLayoutVersion.VERSION_1), 
new TimelineLayoutV1());
+  }
+
+  public static TimelineLayout getLayout(TimelineLayoutVersion version) {
+return LAYOUT_MAP.get(version);
+  }
+
+  public abstract Stream 
filterHoodieInstants(Stream instantStream);
+
+  /**
+   * Table Layout where state transitions are managed by renaming files.
+   */
+  private static class TimelineLayoutV0 extends TimelineLayout {
 
 Review comment:
   lets add a small unit tests for this, if it doesnot exist already


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357160119
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
 ##
 @@ -112,6 +116,10 @@ public static void createHoodieProperties(FileSystem fs, 
Path metadataFolder, Pr
   if (!properties.containsKey(HOODIE_ARCHIVELOG_FOLDER_PROP_NAME)) {
 properties.setProperty(HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, 
DEFAULT_ARCHIVELOG_FOLDER);
   }
+  if (!properties.containsKey(HOODIE_TIMELINE_LAYOUT_VERSION)) {
 
 Review comment:
   Does this mean that only new datasets will get version 1? What happens to 
existing datasets? They stick to version 0?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357181383
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -414,23 +436,41 @@ public String getCommitActionType() {
   /**
* Helper method to scan all hoodie-instant metafiles and construct 
HoodieInstant objects.
*
-   * @param fs FileSystem
-   * @param metaPath Meta Path where hoodie instants are present
* @param includedExtensions Included hoodie extensions
+   * @param applyLayoutVersionFilters Depending on Timeline layout version, if 
there are multiple states for the same
+   * action instant, only include the highest state
* @return List of Hoodie Instants generated
* @throws IOException in case of failure
*/
-  public static List 
scanHoodieInstantsFromFileSystem(FileSystem fs, Path metaPath,
-  Set includedExtensions) throws IOException {
-return Arrays.stream(HoodieTableMetaClient.scanFiles(fs, metaPath, path -> 
{
-  // Include only the meta files with extensions that needs to be included
-  String extension = FSUtils.getFileExtension(path.getName());
-  return includedExtensions.contains(extension);
-})).sorted(Comparator.comparing(
-// Sort the meta-data by the instant time (first part of the file name)
-fileStatus -> FSUtils.getInstantTime(fileStatus.getPath().getName(
-// create HoodieInstantMarkers from FileStatus, which extracts 
properties
-.map(HoodieInstant::new).collect(Collectors.toList());
+  public List scanHoodieInstantsFromFileSystem(Set 
includedExtensions,
+  boolean applyLayoutVersionFilters) throws IOException {
+return scanHoodieInstantsFromFileSystem(new Path(metaPath), 
includedExtensions, applyLayoutVersionFilters);
+  }
+
+  /**
+   * Helper method to scan all hoodie-instant metafiles and construct 
HoodieInstant objects.
+   *
+   * @param timelinePath MetaPath where instant files are stored
+   * @param includedExtensions Included hoodie extensions
+   * @param applyLayoutVersionFilters Depending on Timeline layout version, if 
there are multiple states for the same
+   * action instant, only include the highest state
+   * @return List of Hoodie Instants generated
+   * @throws IOException in case of failure
+   */
+  public List scanHoodieInstantsFromFileSystem(Path 
timelinePath, Set includedExtensions,
+  boolean applyLayoutVersionFilters) throws IOException {
+Stream instantStream = Arrays.stream(
+HoodieTableMetaClient
+.scanFiles(getFs(), timelinePath, path -> {
+  // Include only the meta files with extensions that needs to be 
included
+  String extension = FSUtils.getFileExtension(path.getName());
+  return includedExtensions.contains(extension);
+})).map(HoodieInstant::new);
+
+if (applyLayoutVersionFilters) {
+  instantStream = 
TimelineLayout.getLayout(getTimelineLayoutVersion()).filterHoodieInstants(instantStream);
 
 Review comment:
   Seems the `applyLayoutVersionFilters` is set selectively using which 
HoodieActiveTimeline constructor is invoked? Would this be fragile.. Thinking 
out loud, applying filters on V0, has no effect since there are nothing to get 
rid off. Only thing that could do wrong is not filtering V1.. hmmm


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357162132
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -274,25 +286,35 @@ public synchronized HoodieArchivedTimeline 
getArchivedTimeline() {
*/
   public static HoodieTableMetaClient initTableType(Configuration hadoopConf, 
String basePath, String tableType,
   String tableName, String archiveLogFolder) throws IOException {
-HoodieTableType type = HoodieTableType.valueOf(tableType);
-Properties properties = new Properties();
-properties.put(HoodieTableConfig.HOODIE_TABLE_NAME_PROP_NAME, tableName);
-properties.put(HoodieTableConfig.HOODIE_TABLE_TYPE_PROP_NAME, type.name());
-properties.put(HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, 
archiveLogFolder);
-return HoodieTableMetaClient.initDatasetAndGetMetaClient(hadoopConf, 
basePath, properties);
+return initTableType(hadoopConf, basePath, 
HoodieTableType.valueOf(tableType), tableName,
+archiveLogFolder, null, null);
   }
 
   /**
* Helper method to initialize a given path, as a given storage type and 
table name.
*/
   public static HoodieTableMetaClient initTableType(Configuration hadoopConf, 
String basePath,
   HoodieTableType tableType, String tableName, String payloadClassName) 
throws IOException {
+return initTableType(hadoopConf, basePath, tableType, tableName, null, 
payloadClassName, null);
+  }
+
+  public static HoodieTableMetaClient initTableType(Configuration hadoopConf, 
String basePath,
+  HoodieTableType tableType, String tableName, String archiveLogFolder, 
String payloadClassName,
+  Integer timelineLayoutVersion) throws IOException {
 Properties properties = new Properties();
 properties.setProperty(HoodieTableConfig.HOODIE_TABLE_NAME_PROP_NAME, 
tableName);
 properties.setProperty(HoodieTableConfig.HOODIE_TABLE_TYPE_PROP_NAME, 
tableType.name());
-if (tableType == HoodieTableType.MERGE_ON_READ) {
+if ((tableType == HoodieTableType.MERGE_ON_READ) && (payloadClassName != 
null)) {
 
 Review comment:
   Redundant braces?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357117901
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/DatasetsCommand.java
 ##
 @@ -85,6 +86,8 @@ public String createTable(
   @CliOption(key = {"tableName"}, mandatory = true, help = "Hoodie Table 
Name") final String name,
   @CliOption(key = {"tableType"}, unspecifiedDefaultValue = 
"COPY_ON_WRITE",
   help = "Hoodie Table Type. Must be one of : COPY_ON_WRITE or 
MERGE_ON_READ") final String tableTypeStr,
+  @CliOption(key = {"archiveLogFolder"}, help = "Folder Name for storing 
archived timeline") String archiveFolder,
+  @CliOption(key = {"layoutVersion"}, help = "Specific Layouyt Version to 
use") Integer layoutVersion,
 
 Review comment:
   typo: layout version to use


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357127905
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java
 ##
 @@ -32,7 +35,24 @@
  *
  * @see HoodieTimeline
  */
-public class HoodieInstant implements Serializable {
+public class HoodieInstant implements Serializable, Comparable {
+
+  /**
+   * A COMPACTION action eventually becomes COMMIT when completed. So, when 
grouping instants
+   * for state transitions, this needs to be taken into account
+   */
+  private static final Map COMPARABLE_ACTIONS = new 
ImmutableMap.Builder()
+  .put(HoodieTimeline.COMPACTION_ACTION, 
HoodieTimeline.COMMIT_ACTION).build();
+
+  public static final Comparator ACTION_COMPARATOR =
+  Comparator.comparing(instant -> 
getCompatibleAction(instant.getAction()));
+
+  public static final Comparator COMPARATOR = 
Comparator.comparing(HoodieInstant::getTimestamp)
+  .thenComparing(ACTION_COMPARATOR).thenComparing(HoodieInstant::getState);
+
+  public static final String getCompatibleAction(String action) {
 
 Review comment:
   consistent naming: compatible or comparable? lets pick one?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-12 Thread GitBox
vinothchandar commented on a change in pull request #1009:  [HUDI-308] Avoid 
Renames for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r357132086
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##
 @@ -733,6 +746,11 @@ public HoodieWriteConfig build() {
   setDefaultOnCondition(props, !isConsistencyGuardSet,
   ConsistencyGuardConfig.newBuilder().fromProperties(props).build());
 
+  String layoutVersion = 
props.getProperty(OVERRIDDEN_TIMELINE_LAYOUT_VERSION);
 
 Review comment:
   lets handle this consistently as with other defaults?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1102: [HUDI-391] Rename module name from hudi-bench to hudi-test-suite

2019-12-12 Thread GitBox
yanghua commented on issue #1102: [HUDI-391] Rename module name from hudi-bench 
to hudi-test-suite
URL: https://github.com/apache/incubator-hudi/pull/1102#issuecomment-565020884
 
 
   @n3nash Can you review this PR?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-391) Rename module name from hudi-bench to hudi-test-suite

2019-12-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-391:

Labels: pull-request-available  (was: )

> Rename module name from hudi-bench to hudi-test-suite
> -
>
> Key: HUDI-391
> URL: https://issues.apache.org/jira/browse/HUDI-391
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua opened a new pull request #1102: [HUDI-391] Rename module name from hudi-bench to hudi-test-suite

2019-12-12 Thread GitBox
yanghua opened a new pull request #1102: [HUDI-391] Rename module name from 
hudi-bench to hudi-test-suite
URL: https://github.com/apache/incubator-hudi/pull/1102
 
 
   
   
   ## What is the purpose of the pull request
   
   *Rename module name from hudi-bench to hudi-test-suite*
   
   ## Brief change log
   
 - *Rename module name from hudi-bench to hudi-test-suite*
   
   ## Verify this pull request
   
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (HUDI-233) Redo log statements using SLF4J

2019-12-12 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994668#comment-16994668
 ] 

lamber-ken edited comment on HUDI-233 at 12/12/19 1:50 PM:
---

Hi, [~xleesf] [~vinoth], I think we can use `maven-shade-plugin` to relocate 
the slf4j and log4j, if so, it will work well with other framework. :D

For example
{code:java}

org.apache.maven.plugins
maven-shade-plugin
3.2.0

false



package

shade




org.apache.log4j

org.apache.hudi.shade.log.log4j


org.slf4j

org.apache.hudi.shade.log.slf4j






{code}


was (Author: lamber-ken):
Hi, [~xleesf] [~vinoth], I think we can use `maven-shade-plugin` to relocate 
the slf4j and log4j, if so, it will work well with other framework. :D

For example
{code:java}

org.apache.maven.plugins
maven-shade-plugin
3.2.0


package

shade




org.apache.log4j

org.apache.hudi.shade.log.log4j


org.slf4j

org.apache.hudi.shade.log.slf4j






{code}

> Redo log statements using SLF4J 
> 
>
> Key: HUDI-233
> URL: https://issues.apache.org/jira/browse/HUDI-233
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Affects Versions: 0.5.0
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> Currently we are not employing variable substitution aggresively in the 
> project.  ala 
> {code:java}
> LogManager.getLogger(SomeName.class.getName()).info("Message: {}, Detail: 
> {}", message, detail);
> {code}
> This can improve performance since the string concatenation is deferrable to 
> when the logging is actually in effect.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-233) Redo log statements using SLF4J

2019-12-12 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994668#comment-16994668
 ] 

lamber-ken edited comment on HUDI-233 at 12/12/19 1:35 PM:
---

Hi, [~xleesf] [~vinoth], I think we can use `maven-shade-plugin` to relocate 
the slf4j and log4j, if so, it will work well with other framework. :D

For example
{code:java}

org.apache.maven.plugins
maven-shade-plugin
3.2.0


package

shade




org.apache.log4j

org.apache.hudi.shade.log.log4j


org.slf4j

org.apache.hudi.shade.log.slf4j






{code}


was (Author: lamber-ken):
Hi, [~xleesf] [~vinoth], I think we can use `maven-shade-plugin` to relocate 
the slf4j and log4j, if so, it will work well with other framework. :D

For example
{code:java}

org.apache.maven.plugins
maven-shade-plugin
3.2.0


package

shade





org.apache.log4j

org.apache.hudi.shade.log.log4j


org.slf4j

org.apache.hudi.shade.log.slf4j






{code}

> Redo log statements using SLF4J 
> 
>
> Key: HUDI-233
> URL: https://issues.apache.org/jira/browse/HUDI-233
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Affects Versions: 0.5.0
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> Currently we are not employing variable substitution aggresively in the 
> project.  ala 
> {code:java}
> LogManager.getLogger(SomeName.class.getName()).info("Message: {}, Detail: 
> {}", message, detail);
> {code}
> This can improve performance since the string concatenation is deferrable to 
> when the logging is actually in effect.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-233) Redo log statements using SLF4J

2019-12-12 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994668#comment-16994668
 ] 

lamber-ken commented on HUDI-233:
-

Hi, [~xleesf] [~vinoth], I think we can use `maven-shade-plugin` to relocate 
the slf4j and log4j, if so, it will work well with other framework. :D

For example
{code:java}

org.apache.maven.plugins
maven-shade-plugin
3.2.0


package

shade





org.apache.log4j

org.apache.hudi.shade.log.log4j


org.slf4j

org.apache.hudi.shade.log.slf4j






{code}

> Redo log statements using SLF4J 
> 
>
> Key: HUDI-233
> URL: https://issues.apache.org/jira/browse/HUDI-233
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Affects Versions: 0.5.0
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> Currently we are not employing variable substitution aggresively in the 
> project.  ala 
> {code:java}
> LogManager.getLogger(SomeName.class.getName()).info("Message: {}, Detail: 
> {}", message, detail);
> {code}
> This can improve performance since the string concatenation is deferrable to 
> when the logging is actually in effect.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] Akshay2Agarwal opened a new issue #1101: [SUPPORT] Does Hudi support incremental changes view in HoodieReadClient

2019-12-12 Thread GitBox
Akshay2Agarwal opened a new issue #1101: [SUPPORT] Does Hudi support 
incremental changes view in HoodieReadClient
URL: https://github.com/apache/incubator-hudi/issues/1101
 
 
   **I am not seeing any support for begin time for incremental view in 
HoodieReadClient. Does it support the incremental view?**
   
   I was trying out Read Client of Hudi 
(hudi-client/src/main/java/org/apache/hudi/HoodieReadClient.java) but did not 
get method to perform time based queries using the same.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Go to hudi-client/src/main/java/org/apache/hudi/HoodieReadClient.java
   2. There is only readROView method and did not have support for creating 
incremental view
   
   **Environment Description**
   
   * Hudi version : 0.5.0-incubating
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.6
   
   * Hadoop version : 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2019-12-12 Thread GitBox
yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support 
long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-565001918
 
 
   I have rebased the old `hudi_test_suite_refactor` branch based on master 
(Fixed conflicts and check style errors). From now, let track the work in this 
PR. FYI @n3nash @vinothchandar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Closed] (HUDI-399) Fix conflicts of test suite basic implementation based on master branch and fix checkstyle issues

2019-12-12 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-399.
-
Resolution: Won't Fix

I have rebased HUDI-394 and fixed checkstyle error in that stage. 

> Fix conflicts of test suite basic implementation based on master branch and 
> fix checkstyle issues
> -
>
> Key: HUDI-399
> URL: https://issues.apache.org/jira/browse/HUDI-399
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> There are many conflicts when rebasing on master branch, we need to fix those 
> conflicts:
>  
> {code:java}
> hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.javahudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.javahudi-hive/src/test/java/org/apache/hudi/hive/util/HiveTestService.javahudi-spark/src/main/java/org/apache/hudi/ComplexKeyGenerator.javahudi-spark/src/main/java/org/apache/hudi/EmptyHoodieRecordPayload.javahudi-spark/src/main/java/org/apache/hudi/KeyGenerator.javahudi-spark/src/main/java/org/apache/hudi/NonpartitionedKeyGenerator.javahudi-spark/src/main/java/org/apache/hudi/SimpleKeyGenerator.javahudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.javahudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.javahudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerMetrics.javahudi-utilities/src/main/java/org/apache/hudi/utilities/keygen/TimestampBasedKeyGenerator.javahudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroDFSSource.java
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-322) DeltaSteamer should pick checkpoints off only deltacommits for MOR tables

2019-12-12 Thread Shahida Khan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994635#comment-16994635
 ] 

Shahida Khan commented on HUDI-322:
---

Thank you both, will keep you posted regarding the progress. :)

> DeltaSteamer should pick checkpoints off only deltacommits for MOR tables
> -
>
> Key: HUDI-322
> URL: https://issues.apache.org/jira/browse/HUDI-322
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: Shahida Khan
>Priority: Major
> Fix For: 0.5.1
>
>
> When using DeltaStreamer with MOR, the checkpoints would be written out to 
> .deltacommit files (and not .commit files). We need to confirm the behavior 
> and change code such that it reads from the correct metadata file..  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-322) DeltaSteamer should pick checkpoints off only deltacommits for MOR tables

2019-12-12 Thread Shahida Khan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shahida Khan updated HUDI-322:
--
Status: In Progress  (was: Open)

> DeltaSteamer should pick checkpoints off only deltacommits for MOR tables
> -
>
> Key: HUDI-322
> URL: https://issues.apache.org/jira/browse/HUDI-322
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: Shahida Khan
>Priority: Major
> Fix For: 0.5.1
>
>
> When using DeltaStreamer with MOR, the checkpoints would be written out to 
> .deltacommit files (and not .commit files). We need to confirm the behavior 
> and change code such that it reads from the correct metadata file..  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-394) Provide a basic implementation of test suite

2019-12-12 Thread vinoyang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993472#comment-16993472
 ] 

vinoyang edited comment on HUDI-394 at 12/12/19 1:05 PM:
-

Implemented via hudi_test_suite_refactor branch: 
eaaf3f623a452e6b8f82532231c48798d97cab6c


was (Author: yanghua):
Implemented via hudi_test_suite_refactor branch: 
c2c9347a918f7f17bdb6736cb63ba617849716ce

> Provide a basic implementation of test suite
> 
>
> Key: HUDI-394
> URL: https://issues.apache.org/jira/browse/HUDI-394
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It provides:
>  * Flexible schema payload generation
>  * Different types of workload generation such as inserts, upserts etc
>  * Post process actions to perform validations
>  * Interoperability of test suite to use HoodieWriteClient and 
> HoodieDeltaStreamer so both code paths can be tested
>  * Custom workload sequence generator
>  * Ability to perform parallel operations, such as upsert and compaction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-394) Provide a basic implementation of test suite

2019-12-12 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang reopened HUDI-394:
---

> Provide a basic implementation of test suite
> 
>
> Key: HUDI-394
> URL: https://issues.apache.org/jira/browse/HUDI-394
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It provides:
>  * Flexible schema payload generation
>  * Different types of workload generation such as inserts, upserts etc
>  * Post process actions to perform validations
>  * Interoperability of test suite to use HoodieWriteClient and 
> HoodieDeltaStreamer so both code paths can be tested
>  * Custom workload sequence generator
>  * Ability to perform parallel operations, such as upsert and compaction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua opened a new pull request #1100: [HUDI-394] Provide a basic implementation of test suite

2019-12-12 Thread GitBox
yanghua opened a new pull request #1100: [HUDI-394] Provide a basic 
implementation of test suite
URL: https://github.com/apache/incubator-hudi/pull/1100
 
 
   - Flexible schema payload generation
   - Different types of workload generation such as inserts, upserts etc
   - Post process actions to perform validations
   - Interoperability of test suite to use HoodieWriteClient and 
HoodieDeltaStreamer so both code paths can be tested
   - Custom workload sequence generator
   - Ability to perform parallel operations, such as upsert and compaction


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-394) Provide a basic implementation of test suite

2019-12-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-394:

Labels: pull-request-available  (was: )

> Provide a basic implementation of test suite
> 
>
> Key: HUDI-394
> URL: https://issues.apache.org/jira/browse/HUDI-394
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
>
> It provides:
>  * Flexible schema payload generation
>  * Different types of workload generation such as inserts, upserts etc
>  * Post process actions to perform validations
>  * Interoperability of test suite to use HoodieWriteClient and 
> HoodieDeltaStreamer so both code paths can be tested
>  * Custom workload sequence generator
>  * Ability to perform parallel operations, such as upsert and compaction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch hudi_test_suite_refactor created (now eaaf3f6)

2019-12-12 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


  at eaaf3f6  [HUDI-394] Provide a basic implementation of test suite

No new revisions were added by this update.



[incubator-hudi] branch feature_hudi_test_suite created (now eaaf3f6)

2019-12-12 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch feature_hudi_test_suite
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


  at eaaf3f6  [HUDI-394] Provide a basic implementation of test suite

This branch includes the following new commits:

 new eaaf3f6  [HUDI-394] Provide a basic implementation of test suite

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




[jira] [Updated] (HUDI-386) Refactor hudi scala checkstyle rules

2019-12-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-386:

Labels: pull-request-available  (was: )

> Refactor hudi scala checkstyle rules
> 
>
> Key: HUDI-386
> URL: https://issues.apache.org/jira/browse/HUDI-386
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>
> Refactor hudi scala checkstyle rules



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1099: [HUDI-386] Refactor hudi scala checkstyle rules

2019-12-12 Thread GitBox
lamber-ken opened a new pull request #1099: [HUDI-386] Refactor hudi scala 
checkstyle rules
URL: https://github.com/apache/incubator-hudi/pull/1099
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Check scala style rules one by one, after that Refactor hudi scala 
checkstyle rules
   
   ## Brief change log
   
 - Refactor hudi scala checkstyle rules, keep the level error which rule 
can not pass.
   
   ## Verify this pull request
   
   This pull request is code cleanup without any test coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua closed pull request #1057: Hudi Test Suite

2019-12-12 Thread GitBox
yanghua closed pull request #1057: Hudi Test Suite
URL: https://github.com/apache/incubator-hudi/pull/1057
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch hudi_test_suite created (now 4ac72f4)

2019-12-12 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


  at 4ac72f4  Rename module name from hudi-bench to hudi-end-to-end-tests

This branch includes the following new commits:

 new 4ac72f4  Rename module name from hudi-bench to hudi-end-to-end-tests

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




[incubator-hudi] 01/01: Rename module name from hudi-bench to hudi-end-to-end-tests

2019-12-12 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch hudi_test_suite
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git

commit 4ac72f4f62800725a469b9a8bc8110acb0b18fb1
Author: yanghua 
AuthorDate: Tue Dec 3 19:44:15 2019 +0800

Rename module name from hudi-bench to hudi-end-to-end-tests
---
 docker/hoodie/hadoop/hive_base/Dockerfile  |  2 +-
 docker/hoodie/hadoop/hive_base/pom.xml |  4 ++--
 {hudi-bench => hudi-end-to-end-tests}/pom.xml  |  2 +-
 .../prepare_integration_suite.sh   |  4 ++--
 .../apache/hudi/e2e}/DFSDeltaWriterAdapter.java|  6 +++---
 .../apache/hudi/e2e}/DFSSparkAvroDeltaWriter.java  |  8 
 .../org/apache/hudi/e2e}/DeltaInputFormat.java |  2 +-
 .../java/org/apache/hudi/e2e}/DeltaOutputType.java |  2 +-
 .../org/apache/hudi/e2e}/DeltaWriterAdapter.java   |  4 ++--
 .../org/apache/hudi/e2e}/DeltaWriterFactory.java   | 10 -
 .../hudi/e2e}/configuration/DFSDeltaConfig.java|  6 +++---
 .../hudi/e2e}/configuration/DeltaConfig.java   |  6 +++---
 .../hudi/e2e}/converter/UpdateConverter.java   |  6 +++---
 .../java/org/apache/hudi/e2e}/dag/DagUtils.java|  9 
 .../org/apache/hudi/e2e}/dag/ExecutionContext.java |  9 
 .../java/org/apache/hudi/e2e}/dag/WorkflowDag.java |  4 ++--
 .../apache/hudi/e2e}/dag/WorkflowDagGenerator.java | 12 +--
 .../apache/hudi/e2e}/dag/nodes/BulkInsertNode.java |  6 +++---
 .../org/apache/hudi/e2e}/dag/nodes/CleanNode.java  |  4 ++--
 .../apache/hudi/e2e}/dag/nodes/CompactNode.java|  6 +++---
 .../org/apache/hudi/e2e}/dag/nodes/DagNode.java|  6 +++---
 .../apache/hudi/e2e}/dag/nodes/HiveQueryNode.java  |  8 
 .../apache/hudi/e2e}/dag/nodes/HiveSyncNode.java   |  8 
 .../org/apache/hudi/e2e}/dag/nodes/InsertNode.java | 10 -
 .../apache/hudi/e2e}/dag/nodes/RollbackNode.java   |  6 +++---
 .../hudi/e2e}/dag/nodes/ScheduleCompactNode.java   |  6 +++---
 .../hudi/e2e}/dag/nodes/SparkSQLQueryNode.java |  8 
 .../org/apache/hudi/e2e}/dag/nodes/UpsertNode.java |  8 
 .../apache/hudi/e2e}/dag/nodes/ValidateNode.java   |  6 +++---
 .../hudi/e2e}/dag/scheduler/DagScheduler.java  | 12 +--
 .../apache/hudi/e2e}/generator/DeltaGenerator.java | 24 +++---
 .../FlexibleSchemaRecordGenerationIterator.java|  2 +-
 .../GenericRecordFullPayloadGenerator.java |  2 +-
 .../GenericRecordFullPayloadSizeEstimator.java |  2 +-
 .../GenericRecordPartialPayloadGenerator.java  |  2 +-
 .../generator/LazyRecordGeneratorIterator.java |  2 +-
 .../e2e}/generator/UpdateGeneratorIterator.java|  2 +-
 .../e2e}/helpers/DFSTestSuitePathSelector.java |  2 +-
 .../hudi/e2e}/helpers/HiveServiceProvider.java |  6 +++---
 .../hudi/e2e}/job/HoodieDeltaStreamerWrapper.java  |  2 +-
 .../apache/hudi/e2e}/job/HoodieTestSuiteJob.java   | 24 +++---
 .../hudi/e2e}/reader/DFSAvroDeltaInputReader.java  |  8 
 .../hudi/e2e}/reader/DFSDeltaInputReader.java  |  2 +-
 .../e2e}/reader/DFSHoodieDatasetInputReader.java   |  2 +-
 .../e2e}/reader/DFSParquetDeltaInputReader.java|  6 +++---
 .../apache/hudi/e2e}/reader/DeltaInputReader.java  |  2 +-
 .../apache/hudi/e2e}/reader/SparkBasedReader.java  |  2 +-
 .../hudi/e2e}/writer/AvroDeltaInputWriter.java |  2 +-
 .../apache/hudi/e2e}/writer/DeltaInputWriter.java  |  2 +-
 .../org/apache/hudi/e2e}/writer/DeltaWriter.java   |  6 +++---
 .../hudi/e2e}/writer/FileDeltaInputWriter.java |  2 +-
 .../e2e}/writer/SparkAvroDeltaInputWriter.java |  2 +-
 .../org/apache/hudi/e2e}/writer/WriteStats.java|  2 +-
 .../hudi/e2e}/TestDFSDeltaWriterAdapter.java   | 16 +++
 .../apache/hudi/e2e}/TestFileDeltaInputWriter.java | 14 ++---
 .../e2e}/configuration/TestWorkflowBuilder.java| 12 +--
 .../hudi/e2e}/converter/TestUpdateConverter.java   |  4 ++--
 .../org/apache/hudi/e2e}/dag/TestComplexDag.java   | 14 +++--
 .../org/apache/hudi/e2e}/dag/TestDagUtils.java | 13 +++-
 .../org/apache/hudi/e2e}/dag/TestHiveSyncDag.java  | 14 +++--
 .../apache/hudi/e2e}/dag/TestInsertOnlyDag.java| 10 +
 .../apache/hudi/e2e}/dag/TestInsertUpsertDag.java  | 12 ++-
 .../TestGenericRecordPayloadEstimator.java |  7 ---
 .../TestGenericRecordPayloadGenerator.java | 16 ---
 .../hudi/e2e}/generator/TestWorkloadGenerator.java | 22 ++--
 .../hudi/e2e}/job/TestHoodieTestSuiteJob.java  | 22 ++--
 .../e2e}/reader/TestDFSAvroDeltaInputReader.java   |  4 ++--
 .../reader/TestDFSHoodieDatasetInputReader.java|  5 +++--
 .../java/org/apache/hudi/e2e}/utils/TestUtils.java |  2 +-
 .../apache/hudi/e2e}/writer/TestDeltaWriter.java   |  7 ---
 .../resources/hudi-bench-config/base.properties|  0
 .../hudi-bench-config/complex-source.avsc 

[incubator-hudi] branch hudi_test_suite_refactor created (now c2c9347)

2019-12-12 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


  at c2c9347  [HUDI-394] Provide a basic implementation of test suite

No new revisions were added by this update.



[incubator-hudi] branch hudi_test_suite_refactor created (now c2c9347)

2019-12-12 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


  at c2c9347  [HUDI-394] Provide a basic implementation of test suite

No new revisions were added by this update.



[jira] [Updated] (HUDI-233) Redo log statements using SLF4J

2019-12-12 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-233:

Component/s: (was: Performance)

> Redo log statements using SLF4J 
> 
>
> Key: HUDI-233
> URL: https://issues.apache.org/jira/browse/HUDI-233
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Affects Versions: 0.5.0
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> Currently we are not employing variable substitution aggresively in the 
> project.  ala 
> {code:java}
> LogManager.getLogger(SomeName.class.getName()).info("Message: {}, Detail: 
> {}", message, detail);
> {code}
> This can improve performance since the string concatenation is deferrable to 
> when the logging is actually in effect.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-399) Fix conflicts of test suite basic implementation based on master branch and fix checkstyle issues

2019-12-12 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-399:
--
Summary: Fix conflicts of test suite basic implementation based on master 
branch and fix checkstyle issues  (was: Fix conflicts of test suite basic 
implementation based on master branch)

> Fix conflicts of test suite basic implementation based on master branch and 
> fix checkstyle issues
> -
>
> Key: HUDI-399
> URL: https://issues.apache.org/jira/browse/HUDI-399
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> There are many conflicts when rebasing on master branch, we need to fix those 
> conflicts:
>  
> {code:java}
> hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.javahudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.javahudi-hive/src/test/java/org/apache/hudi/hive/util/HiveTestService.javahudi-spark/src/main/java/org/apache/hudi/ComplexKeyGenerator.javahudi-spark/src/main/java/org/apache/hudi/EmptyHoodieRecordPayload.javahudi-spark/src/main/java/org/apache/hudi/KeyGenerator.javahudi-spark/src/main/java/org/apache/hudi/NonpartitionedKeyGenerator.javahudi-spark/src/main/java/org/apache/hudi/SimpleKeyGenerator.javahudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.javahudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.javahudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerMetrics.javahudi-utilities/src/main/java/org/apache/hudi/utilities/keygen/TimestampBasedKeyGenerator.javahudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroDFSSource.java
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-403) Publish a deployment guide talking about deployment options, upgrading etc

2019-12-12 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-403:
---

 Summary: Publish a deployment guide talking about deployment 
options, upgrading etc
 Key: HUDI-403
 URL: https://issues.apache.org/jira/browse/HUDI-403
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
  Components: Docs
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar
 Fix For: 0.5.1


Things to cover 
 # Upgrade readers first, Upgrade writers next, Principles of compatibility 
followed
 # DeltaStreamer Deployment models
 # Scheduling Compactions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1098: [HUDI-401]Remove unnecessary use of spark in savepoint timeline

2019-12-12 Thread GitBox
lamber-ken commented on issue #1098: [HUDI-401]Remove unnecessary use of spark 
in savepoint timeline
URL: https://github.com/apache/incubator-hudi/pull/1098#issuecomment-564895376
 
 
   +1, from my side.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services