[jira] [Created] (HUDI-378) Refactor the rest codes based on new ImportOrder code style rule

2019-12-03 Thread lamber-ken (Jira)
lamber-ken created HUDI-378:
---

 Summary: Refactor the rest codes based on new ImportOrder code 
style rule
 Key: HUDI-378
 URL: https://issues.apache.org/jira/browse/HUDI-378
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: lamber-ken


Refactor the rest codes based on new ImportOrder code style rule and set 
severity error level



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-378) Refactor the rest codes based on new ImportOrder code style rule

2019-12-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-378:
---

Assignee: lamber-ken

> Refactor the rest codes based on new ImportOrder code style rule
> 
>
> Key: HUDI-378
> URL: https://issues.apache.org/jira/browse/HUDI-378
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Refactor the rest codes based on new ImportOrder code style rule and set 
> severity error level



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated: [HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule (#1076)

2019-12-03 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new b2d9638  [HUDI-365] Refactor hudi-cli based on new ImportOrder code 
style rule (#1076)
b2d9638 is described below

commit b2d9638bea43f6b0a9c3412362dc4cb08e214957
Author: Gurudatt Kulkarni 
AuthorDate: Wed Dec 4 12:40:40 2019 +0530

[HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule 
(#1076)
---
 .../main/java/org/apache/hudi/cli/HoodieCLI.java   |  8 +++---
 .../org/apache/hudi/cli/HoodiePrintHelper.java |  5 +++-
 .../src/main/java/org/apache/hudi/cli/Main.java|  3 ++-
 .../src/main/java/org/apache/hudi/cli/Table.java   |  3 ++-
 .../hudi/cli/commands/ArchivedCommitsCommand.java  | 22 ---
 .../apache/hudi/cli/commands/CleansCommand.java| 14 +-
 .../apache/hudi/cli/commands/CommitsCommand.java   | 16 ++-
 .../hudi/cli/commands/CompactionCommand.java   | 28 ++-
 .../apache/hudi/cli/commands/DatasetsCommand.java  | 10 ---
 .../hudi/cli/commands/FileSystemViewCommand.java   | 28 ++-
 .../cli/commands/HDFSParquetImportCommand.java |  2 ++
 .../hudi/cli/commands/HoodieLogFileCommand.java| 31 --
 .../hudi/cli/commands/HoodieSyncCommand.java   |  6 +++--
 .../apache/hudi/cli/commands/RepairsCommand.java   |  8 +++---
 .../apache/hudi/cli/commands/RollbacksCommand.java | 18 +++--
 .../hudi/cli/commands/SavepointsCommand.java   |  8 +++---
 .../org/apache/hudi/cli/commands/SparkMain.java|  1 +
 .../org/apache/hudi/cli/commands/StatsCommand.java | 30 +++--
 .../java/org/apache/hudi/cli/utils/CommitUtil.java |  5 ++--
 .../java/org/apache/hudi/cli/utils/HiveUtil.java   |  6 +++--
 .../java/org/apache/hudi/cli/utils/SparkUtil.java  |  6 +++--
 21 files changed, 149 insertions(+), 109 deletions(-)

diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieCLI.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieCLI.java
index 0dafdc4..1b2dd86 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieCLI.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieCLI.java
@@ -18,13 +18,15 @@
 
 package org.apache.hudi.cli;
 
-import java.io.IOException;
-import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.fs.FileSystem;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.util.ConsistencyGuardConfig;
 import org.apache.hudi.common.util.FSUtils;
 
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+
+import java.io.IOException;
+
 /**
  * This class is responsible to load table metadata and hoodie related configs.
  */
diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodiePrintHelper.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodiePrintHelper.java
index 0e48911..5325432 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodiePrintHelper.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodiePrintHelper.java
@@ -18,11 +18,14 @@
 
 package org.apache.hudi.cli;
 
+import org.apache.hudi.common.util.Option;
+
 import com.jakewharton.fliptables.FlipTable;
+
 import java.util.List;
 import java.util.Map;
 import java.util.function.Function;
-import org.apache.hudi.common.util.Option;
+
 
 /**
  * Helper class to render table for hoodie-cli.
diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/Main.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/Main.java
index 99627b0..e924be9 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/Main.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/Main.java
@@ -18,9 +18,10 @@
 
 package org.apache.hudi.cli;
 
-import java.io.IOException;
 import org.springframework.shell.Bootstrap;
 
+import java.io.IOException;
+
 /**
  * Main class that delegates to Spring Shell's Bootstrap class in order to 
simplify debugging inside an IDE.
  */
diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/Table.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/Table.java
index 5a446e7..2efad37 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/Table.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/Table.java
@@ -18,6 +18,8 @@
 
 package org.apache.hudi.cli;
 
+import org.apache.hudi.common.util.Option;
+
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Comparator;
@@ -28,7 +30,6 @@ import java.util.function.Consumer;
 import java.util.function.Function;
 import java.util.stream.Collectors;
 import java.util.stream.IntStream;
-import org.apache.hudi.common.util.Option;
 
 /**
  * Table to be rendered. This class takes care of ordering rows and limiting 
before renderer renders it.
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
 

[GitHub] [incubator-hudi] yanghua merged pull request #1076: [HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread GitBox
yanghua merged pull request #1076: [HUDI-365] Refactor hudi-cli based on new 
ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1076
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar merged pull request #1049: [HUDI-355] Refactor hudi-common based on new comment and code style rules

2019-12-03 Thread GitBox
bvaradar merged pull request #1049: [HUDI-355] Refactor hudi-common based on 
new comment and code style rules
URL: https://github.com/apache/incubator-hudi/pull/1049
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #118

2019-12-03 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.22 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark [jar]
[INFO] hudi-utilities [jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle  [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle  [jar]
[INFO] hudi-timeline-server-bundle

[GitHub] [incubator-hudi] zhedoubushishi commented on a change in pull request #1036: [HUDI-353] Add hive style partitioning path

2019-12-03 Thread GitBox
zhedoubushishi commented on a change in pull request #1036: [HUDI-353] Add hive 
style partitioning path
URL: https://github.com/apache/incubator-hudi/pull/1036#discussion_r353510140
 
 

 ##
 File path: hudi-hive/src/main/java/org/apache/hudi/hive/HiveSyncTool.java
 ##
 @@ -163,6 +163,11 @@ private void syncSchema(boolean tableExists, boolean 
isRealTime, MessageType sch
*/
   private void syncPartitions(List writtenPartitionsSince) {
 try {
+  if (cfg.useHiveStylePartitioning) {
+LOG.info("Sync partitions through MSCK Repair");
+hoodieHiveClient.syncPartitionsByMSCK();
 
 Review comment:
   I see. Then it seems we should not use "MSCK" here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1076: [HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread GitBox
lamber-ken edited a comment on issue #1076: [HUDI-365] Refactor hudi-cli based 
on new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1076#issuecomment-561157290
 
 
    


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1077: Improvements to DiskbasedMap

2019-12-03 Thread GitBox
bvaradar commented on a change in pull request #1077: Improvements to 
DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077#discussion_r353458484
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/BufferedRandomAccessFile.java
 ##
 @@ -0,0 +1,344 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.common.util;
+
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.RandomAccessFile;
+import java.util.Arrays;
+
+import org.apache.log4j.Logger;
+
+/**
+ * A BufferedRandomAccessFile is like a
+ * RandomAccessFile, but it uses a private buffer so that most
+ * operations do not require a disk access.
+ * 
+ *
+ * Note: The operations on this class are unmonitored. Also, the correct
 
 Review comment:
   @nbalajee : Can you update LICENSE file in the repository and a section 
similar to 
   
   """
   This product includes code from Apache SystemML.
   
   * org.apache.hudi.func.LazyIterableIterator  adapted from 
org/apache/sysml/runtime/instructions/spark/data/LazyIterableIterator
   
   Copyright: 2015-2018 The Apache Software Foundation
   Home page: https://systemml.apache.org/
   License: http://www.apache.org/licenses/LICENSE-2.0
   """


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nbalajee opened a new pull request #1077: Improvements to DiskbasedMap

2019-12-03 Thread GitBox
nbalajee opened a new pull request #1077: Improvements to DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077
 
 
   ## What is the purpose of the pull request
   DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a 
file,
 keeping the (K, fileMetadata) in memory, to reduce the foot print of the 
record on disk.
   
 This change improves the performance of the record get/read (random 
read/sequential read) and put/write operations from/to disk, by introducing a 
data buffer/cache.
   
 Before the performance improvement:
   RecordsHandled:  1   totalTestTime:  3145writeTime:  1176
readTime:   255
   RecordsHandled:  5   totalTestTime:  5775writeTime:  4187
readTime:   1175
   RecordsHandled:  10  totalTestTime:  10570   writeTime:  7718
readTime:   2203
   RecordsHandled:  50  totalTestTime:  59723   writeTime:  45618   
readTime:   11093
   RecordsHandled:  100 totalTestTime:  120022  writeTime:  87918   
readTime:   22355
   RecordsHandled:  200 totalTestTime:  258627  writeTime:  187185  
readTime:   56431
   
 After the improvement:
   RecordsHandled: 1 totalTestTime: 1551 writeTime: 531 seqReadTime: 122 
randReadTime: 125
   RecordsHandled: 5 totalTestTime: 1371 writeTime: 420 seqReadTime: 179 
randReadTime: 250
   RecordsHandled: 10 totalTestTime: 1895 writeTime: 535 seqReadTime: 181 
randReadTime: 512
   RecordsHandled: 50 totalTestTime: 8838 writeTime: 2031 seqReadTime: 1128 
randReadTime: 2580
   RecordsHandled: 100 totalTestTime: 16147 writeTime: 4059 seqReadTime: 
1634 randReadTime: 5293
   RecordsHandled: 200 totalTestTime: 34090 writeTime: 8337 seqReadTime: 
3163 randReadTime: 10694
   
   
   ## Brief change log
   
   - Using BufferedRandomAccessFile instead of RandomAccessFile, in read path.
   - Using BufferedOutputStream in the write path. 
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as 
   TestDiskBasedMap:testSimpleInsert
   
   ## Committer checklist
   
- [x ] Has a corresponding JIRA in PR title & commit .  
   https://issues.apache.org/jira/browse/HUDI-335
   
- [x ] Commit message is descriptive of the change

- [ x] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Closed] (HUDI-294) Delete Paths written in Cleaner plan needs to be relative to partition-path

2019-12-03 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-294.
--
Resolution: Fixed

Fixed via master: 98ab33bb6e1637d18c8fab7a9ddd50daeaf56962

> Delete Paths written in Cleaner plan needs to be relative to partition-path
> ---
>
> Key: HUDI-294
> URL: https://issues.apache.org/jira/browse/HUDI-294
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The deleted file paths stored in Clean metadata are all absolute. They need 
> to be changed to relative path.
> The challenge would be to handle cases when both version of cleaner metadata 
> are present and needs to be processed  (backwards compatibility)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pushpavanthar commented on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-12-03 Thread GitBox
pushpavanthar commented on issue #969: [HUDI-251] JDBC incremental load to HUDI 
DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-561331620
 
 
   @vinothchandar Thanks. 
   It would be great if we document this in design doc and proceed from there. 
I've been using JDBC incremental puller approach as one of the sources to 
Apache HUDI at work. I'm very excited about this feature. 
   In my opinion, user shouldn't be aware of query unless it is of special case 
(very rare). All incremental pulling of data follow same query template to 
which check-pointed values are substituted. 
   However, I would like to understand where this processor of HUDI maintains 
checkpointing/state data. If it is filesystem, are we going to provide this 
filesystem path as config? Or is It external state store?
   If you can redirect me to the doc to this feature, I would like to add my 
thoughts to it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-12-03 Thread GitBox
pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load 
to HUDI DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-559230742
 
 
   Hi @vinothchandar and @taherk77 
   I would like to add 2 points to this feature to make this very generic
   
   - [ ] We might need support for combination of incrementing columns. 
Incrementing columns can be of below types 
   1. Timestamp columns
   2. Auto Incrementing column
   3. Timestamp + Auto Incrementing.
   Instead of code figuring out the incremental pull strategy, it'll be better 
if user provides it as config for each table.
   Considering Timestamp incrementing column, there can be more than once 
column contributing to this strategy. e.g. When a row is creation, only 
`created_at` column is set and `updated_at` is null by default. When the same 
row is updated, `updated_at` gets assigned to some timestamp. In such cases it 
is wise to consider both columns in the query formation. 
   
   - [x] We need to sort rows according to above mentioned incrementing columns 
to fetch rows in chunks (you can make use of `defaultFetchSize` in MySQL). I'm 
aware that sorting adds load on Database, but it helps in tracking the last 
pulled timestamp or auto incrementing id and help retry/resume from the point 
last recorded. This will be a saviour during failures.
   
   A sample MySQL query for incrementing timestamp columns as (`created_at` and 
`updated_at`)  might look like 
   `SELECT * FROM inventory.customers WHERE 
COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > 
$last_recorder_time AND 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < 
$current_time ORDER BY 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-12-03 Thread GitBox
pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load 
to HUDI DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-559230742
 
 
   Hi @vinothchandar and @taherk77 
   I would like to add 2 points to this feature to make this very generic
   
   - [ ] We might need support for combination of incrementing columns. 
Incrementing columns can be of below types 
   1. Timestamp columns
   2. Auto Incrementing column
   3. Timestamp + Auto Incrementing.
   Instead of code figuring out the incremental pull strategy, it'll be better 
if user provides it as config for each table.
   Considering Timestamp incrementing column, there can be more than once 
column contributing to this strategy. e.g. When a row is creation, only 
`created_at` column is set and `updated_at` is null by default. When the same 
row is updated, `updated_at` gets assigned to some timestamp. In such cases it 
is wise to consider both columns in the query formation. 
   
   - [ ] We need to sort rows according to above mentioned incrementing columns 
to fetch rows in chunks (you can make use of `defaultFetchSize` in MySQL). I'm 
aware that sorting adds load on Database, but it helps in tracking the last 
pulled timestamp or auto incrementing id and help retry/resume from the point 
last recorded. This will be a saviour during failures.
   
   A sample MySQL query for incrementing timestamp columns as (`created_at` and 
`updated_at`)  might look like 
   `SELECT * FROM inventory.customers WHERE 
COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > 
$last_recorder_time AND 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < 
$current_time ORDER BY 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-12-03 Thread GitBox
pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load 
to HUDI DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-559230742
 
 
   Hi @vinothchandar and @taherk77 
   I would like to add 2 points to this feature to make this very generic
   
   - [ ] We might need support for combination of incrementing columns. 
Incrementing columns can be of below types 
   1. Timestamp columns
   2. Auto Incrementing column
   3. Timestamp + Auto Incrementing.
   Instead of code figuring out the incremental pull strategy, it'll be better 
if user provides it as config for each table.
   Considering Timestamp incrementing column, there can be more than once 
column contributing to this strategy. e.g. When a row is creation, only 
`created_at` column is set and `updated_at` is null by default. When the same 
row is updated, `updated_at` gets assigned to some timestamp. In such cases it 
is wise to consider both columns in the query formation. 
   
   - [x] We need to sort rows according to above mentioned incrementing columns 
to fetch rows in chunks (you can make use of `defaultFetchSize` in MySQL). I'm 
aware that sorting adds load on Database, but it helps in tracking the last 
pulled timestamp or auto incrementing id and help retry/resume from the point 
last recorded. This will be a saviour during failures.
   
   A sample MySQL query for incrementing timestamp columns as (`created_at` and 
`updated_at`)  might look like 
   `SELECT * FROM inventory.customers WHERE 
COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > 
$last_recorder_time AND 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < 
$current_time ORDER BY 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-12-03 Thread GitBox
pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load 
to HUDI DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-559230742
 
 
   Hi @vinothchandar and @taherk77 
   I would like to add 2 points to this feature to make this very generic
   
   - [x] We might need support for combination of incrementing columns. 
Incrementing columns can be of below types 
   1. Timestamp columns
   2. Auto Incrementing column
   3. Timestamp + Auto Incrementing.
   Instead of code figuring out the incremental pull strategy, it'll be better 
if user provides it as config for each table.
   Considering Timestamp incrementing column, there can be more than once 
column contributing to this strategy. e.g. When a row is creation, only 
`created_at` column is set and `updated_at` is null by default. When the same 
row is updated, `updated_at` gets assigned to some timestamp. In such cases it 
is wise to consider both columns in the query formation. 
   
   - [x] We need to sort rows according to above mentioned incrementing columns 
to fetch rows in chunks (you can make use of `defaultFetchSize` in MySQL). I'm 
aware that sorting adds load on Database, but it helps in tracking the last 
pulled timestamp or auto incrementing id and help retry/resume from the point 
last recorded. This will be a saviour during failures.
   
   A sample MySQL query for incrementing timestamp columns as (`created_at` and 
`updated_at`)  might look like 
   `SELECT * FROM inventory.customers WHERE 
COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > 
$last_recorder_time AND 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < 
$current_time ORDER BY 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1073: [WIP] [HUDI-377] Adding Delete() support to DeltaStreamer

2019-12-03 Thread GitBox
nsivabalan commented on a change in pull request #1073: [WIP] [HUDI-377] Adding 
Delete() support to DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1073#discussion_r353349031
 
 

 ##
 File path: 
hudi-spark/src/main/java/org/apache/hudi/OverwriteWithLatestAvroPayload.java
 ##
 @@ -60,7 +60,18 @@ public OverwriteWithLatestAvroPayload 
preCombine(OverwriteWithLatestAvroPayload
   @Override
   public Option combineAndGetUpdateValue(IndexedRecord 
currentValue, Schema schema) throws IOException {
 // combining strategy here trivially ignores currentValue on disk and 
writes this record
-return getInsertValue(schema);
+Object deleteMarker = null;
 
 Review comment:
   yes. OverwriteWithLatestAvroPayload is configurable right? What incase some 
user changes this to something else, then deletes in delta streamer may not 
work as expected. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (845e261 -> 98ab33b)

2019-12-03 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 845e261  [MINOR] Update some urls from http to https in the README 
file (#1074)
 add 98ab33b  [HUDI-294] Delete Paths written in Cleaner plan needs to be 
relative to partition-path (#1062)

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/hudi/HoodieCleanClient.java|   5 +-
 .../src/test/java/org/apache/hudi/TestCleaner.java | 102 +
 .../apache/hudi/io/TestHoodieCommitArchiveLog.java |  12 +--
 .../IncrementalTimelineSyncFileSystemView.java |   9 +-
 .../org/apache/hudi/common/util/AvroUtils.java |  22 -
 .../org/apache/hudi/common/util/CleanerUtils.java  |  60 
 .../CleanMetadataMigrator.java}|  18 ++--
 .../versioning/clean/CleanV1MigrationHandler.java  | 102 +
 .../versioning/clean/CleanV2MigrationHandler.java  |  98 
 .../apache/hudi/common/model/HoodieTestUtils.java  |  12 ++-
 .../table/view/TestIncrementalFSViewSync.java  |   4 +-
 11 files changed, 399 insertions(+), 45 deletions(-)
 create mode 100644 
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java
 copy 
hudi-common/src/main/java/org/apache/hudi/common/versioning/{compaction/CompactionPlanMigrator.java
 => clean/CleanMetadataMigrator.java} (68%)
 create mode 100644 
hudi-common/src/main/java/org/apache/hudi/common/versioning/clean/CleanV1MigrationHandler.java
 create mode 100644 
hudi-common/src/main/java/org/apache/hudi/common/versioning/clean/CleanV2MigrationHandler.java



[GitHub] [incubator-hudi] bvaradar merged pull request #1062: [HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path

2019-12-03 Thread GitBox
bvaradar merged pull request #1062: [HUDI-294] Delete Paths written in Cleaner 
plan needs to be relative to partition-path
URL: https://github.com/apache/incubator-hudi/pull/1062
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-12-03 Thread GitBox
vinothchandar commented on issue #969: [HUDI-251] JDBC incremental load to HUDI 
DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-561213523
 
 
   @pushpavanthar Great suggestion.. 
   
   Let me see if we can structure this solution more,. Just supporting raw sql 
as input for extracting the data with the hoodie checkpoint simply being a list 
of string replaces in a template sql, could  provide a lot of flexibility 
   
   Taking the same example from above. 
   
   user specifies the following SQL. (we can blog and document this well)
   
   ```
   hoodie.datasource.jdbc.sql=SELECT 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) as 
created_updated_at, inventory.customers.user_id as user_id, * FROM 
inventory.customers WHERE created_updated_at > ${1} AND created_updated_at < 
${1} AND user_id  > ${2}  ORDER BY created_updated_at ASC
   hoodie.datasource.jdbc.incremental.column.names=created_updated_at, user_id
   hoodie.datasource.jdbc.incremental.column.funcs=max, min
   hoodie.datasource.jdbc.bulkload.sql= 0 etc >
   ```
   
   Hoodie checkpoint is a list of string values, once for each of the 
incremental column names, e.g `2019113048384, 1001` (timestamp and a user_id). 
we simple replace `{1}` with 2019113048384 and `{2}` with the user_id or second 
checkpoint value. Execute the sql, and then use the column funcs to derive the 
next checkpoint values off the fetched data set.. I would prefer to keep this 
computation out of the database and in Spark (for same reasons of avoiding more 
load on database)..
   
   All this said, I want to get a basic version working and checked in :) 
first. 
   @taherk77 where are we at for this PR atm? Are you actively working on this? 
   
   
   
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1068: [WIP] Parquet/HFile Log writer and reader

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1068: [WIP] Parquet/HFile 
Log writer and reader
URL: https://github.com/apache/incubator-hudi/pull/1068#discussion_r353220774
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/logger/InLineReader.java
 ##
 @@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.logger;
+
+import java.io.IOException;
+import java.util.List;
+
+public interface InLineReader {
 
 Review comment:
   not sure I follow why we need to define this interface? for both 
reading/writing.. there is nothing special here right? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1068: [WIP] Parquet/HFile Log writer and reader

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1068: [WIP] Parquet/HFile 
Log writer and reader
URL: https://github.com/apache/incubator-hudi/pull/1068#discussion_r353220921
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/logger/InLineReader.java
 ##
 @@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.logger;
+
+import java.io.IOException;
+import java.util.List;
+
+public interface InLineReader {
 
 Review comment:
   is it more for your test code? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1068: [WIP] Parquet/HFile Log writer and reader

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1068: [WIP] Parquet/HFile 
Log writer and reader
URL: https://github.com/apache/incubator-hudi/pull/1068#discussion_r353222499
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/logger/InLineHFileReader.java
 ##
 @@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.logger;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.hadoop.hbase.util.Pair;
+
+public class InLineHFileReader implements InLineReader> {
+
+  HFileScanner scanner;
+
+  InLineHFileReader(Path path, Configuration conf) throws IOException {
+FileSystem fs = path.getFileSystem(conf);
+HFile.Reader reader = HFile.createReader(fs, path, new CacheConfig(conf), 
conf);
+scanner = reader.getScanner(true, true);
+scanner.seekTo();
+  }
+
+  @Override
+  public Pair read() throws IOException {
+if (scanner.next()) {
 
 Review comment:
   can you also try a point lookup and benchmark it locally? (write a large 1GB 
HFile and read out some keys).. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353202994
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/LocalDynamicBloomFilter.java
 ##
 @@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bloom.filter;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import org.apache.hadoop.util.bloom.BloomFilter;
+import org.apache.hadoop.util.bloom.Key;
+
+/**
+ * Hoodie's Local dynamic Bloom Filter
+ */
+class LocalDynamicBloomFilter extends LocalFilter {
+
+  /**
+   * Threshold for the maximum number of key to record in a dynamic Bloom 
filter row.
+   */
+  private int nr;
+
+  /**
+   * The number of keys recorded in the current standard active Bloom filter.
+   */
+  private int currentNbRecord;
+  private int maxNr;
+  private boolean reachedMax = false;
+  private int curMatrixIndex = 0;
+
+  /**
+   * The matrix of Bloom filter.
+   */
+  private org.apache.hadoop.util.bloom.BloomFilter[] matrix;
+
+  /**
+   * Zero-args constructor for the serialization.
+   */
+  public LocalDynamicBloomFilter() {
+  }
+
+  /**
+   * Constructor.
+   * 
+   * Builds an empty Dynamic Bloom filter.
+   *
+   * @param vectorSize The number of bits in the vector.
+   * @param nbHash The number of hash function to consider.
+   * @param hashType type of the hashing function (see {@link 
org.apache.hadoop.util.hash.Hash}).
+   * @param nr The threshold for the maximum number of keys to record in a 
dynamic Bloom filter row.
+   */
+  public LocalDynamicBloomFilter(int vectorSize, int nbHash, int hashType, int 
nr, int maxNr) {
+super(vectorSize, nbHash, hashType);
+
+this.nr = nr;
+this.currentNbRecord = 0;
+this.maxNr = maxNr;
+
+matrix = new org.apache.hadoop.util.bloom.BloomFilter[1];
+matrix[0] = new org.apache.hadoop.util.bloom.BloomFilter(this.vectorSize, 
this.nbHash, this.hashType);
+  }
+
+  @Override
+  public void add(Key key) {
+if (key == null) {
+  throw new NullPointerException("Key can not be null");
+}
+
+org.apache.hadoop.util.bloom.BloomFilter bf = getActiveStandardBF();
+
+if (bf == null) {
+  addRow();
+  bf = matrix[matrix.length - 1];
+  currentNbRecord = 0;
+}
+
+bf.add(key);
+
+currentNbRecord++;
+  }
+
+  @Override
+  public void and(LocalFilter filter) {
+if (filter == null
+|| !(filter instanceof LocalDynamicBloomFilter)
+|| filter.vectorSize != this.vectorSize
+|| filter.nbHash != this.nbHash) {
+  throw new IllegalArgumentException("filters cannot be and-ed");
+}
+
+LocalDynamicBloomFilter dbf = (LocalDynamicBloomFilter) filter;
+
+if (dbf.matrix.length != this.matrix.length || dbf.nr != this.nr) {
+  throw new IllegalArgumentException("filters cannot be and-ed");
+}
+
+for (int i = 0; i < matrix.length; i++) {
+  matrix[i].and(dbf.matrix[i]);
+}
+  }
+
+  @Override
+  public boolean membershipTest(Key key) {
+if (key == null) {
+  return true;
+}
+
+for (int i = 0; i < matrix.length; i++) {
+  if (matrix[i].membershipTest(key)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  @Override
+  public void not() {
+for (int i = 0; i < matrix.length; i++) {
+  matrix[i].not();
+}
+  }
+
+  @Override
+  public void or(LocalFilter filter) {
+if (filter == null
+|| !(filter instanceof LocalDynamicBloomFilter)
+|| filter.vectorSize != this.vectorSize
+|| filter.nbHash != this.nbHash) {
+  throw new IllegalArgumentException("filters cannot be or-ed");
+}
+
+LocalDynamicBloomFilter dbf = (LocalDynamicBloomFilter) filter;
+
+if (dbf.matrix.length != this.matrix.length || dbf.nr != this.nr) {
+  throw new IllegalArgumentException("filters cannot be or-ed");
+}
+for (int i = 0; i < matrix.length; i++) {
+  matrix[i].or(dbf.matrix[i]);
+}
+  }
+
+  @Override
+  public void xor(LocalFilter filter) {
+if (filter == null
+|| !(filter 

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353201825
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
 ##
 @@ -141,15 +143,26 @@ public static Schema readAvroSchema(Configuration 
configuration, Path parquetFil
* Read out the bloom filter from the parquet file meta data.
*/
   public static BloomFilter readBloomFilterFromParquetMetadata(Configuration 
configuration, Path parquetFilePath) {
-Map footerVals = readParquetFooter(configuration, false, 
parquetFilePath,
-HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY,
-HoodieAvroWriteSupport.OLD_HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY);
+Map footerVals =
+readParquetFooter(configuration, false, parquetFilePath,
+HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY,
+HoodieAvroWriteSupport.OLD_HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY,
+HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_TYPE_CODE);
 String footerVal = 
footerVals.get(HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY);
 if (null == footerVal) {
   // We use old style key "com.uber.hoodie.bloomfilter"
   footerVal = 
footerVals.get(HoodieAvroWriteSupport.OLD_HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY);
 }
-return footerVal != null ? new BloomFilter(footerVal) : null;
+BloomFilter toReturn = null;
+if (footerVal != null) {
+  if 
(footerVals.containsKey(HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_TYPE_CODE)) {
 
 Review comment:
   add a test around this? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353197790
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/HoodieDynamicBoundedBloomFilter.java
 ##
 @@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bloom.filter;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import javax.xml.bind.DatatypeConverter;
+import org.apache.hadoop.util.bloom.Key;
+import org.apache.hudi.exception.HoodieIndexException;
+
+/**
+ * Hoodie's dynamic bloom bounded bloom filter
 
 Review comment:
   good to credit the source always... Add a line such as `based largely on 
Hadoop's DynamicBloomFilter, but with a bound on amount of entries to 
dynamically expand to` 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353194946
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##
 @@ -51,6 +52,11 @@
   // TODO: On by default. Once stable, we will remove the other mode.
   public static final String BLOOM_INDEX_BUCKETIZED_CHECKING_PROP = 
"hoodie.bloom.index.bucketized.checking";
   public static final String DEFAULT_BLOOM_INDEX_BUCKETIZED_CHECKING = "true";
+  public static final String BLOOM_INDEX_FILTER_TYPE_PROP = 
"hoodie.bloom.index.filter.type.prop";
 
 Review comment:
   remove `.prop` from property name 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353196810
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##
 @@ -346,6 +347,18 @@ public int getHBaseIndexDesiredPutsTime() {
 return 
Integer.valueOf(props.getProperty(HoodieHBaseIndexConfig.HOODIE_INDEX_DESIRED_PUTS_TIME_IN_SECS));
   }
 
+  public boolean enableAutoTuneBloomFilter() {
 
 Review comment:
   the filter type is not a boolean right? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353198779
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/HoodieDynamicBoundedBloomFilter.java
 ##
 @@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bloom.filter;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import javax.xml.bind.DatatypeConverter;
+import org.apache.hadoop.util.bloom.Key;
+import org.apache.hudi.exception.HoodieIndexException;
+
+/**
+ * Hoodie's dynamic bloom bounded bloom filter
+ */
+public class HoodieDynamicBoundedBloomFilter extends LocalDynamicBloomFilter 
implements BloomFilter {
+
+  public static final String TYPE_CODE_PREFIX = "DYNAMIC";
+  public static final String TYPE_CODE = TYPE_CODE_PREFIX + "_V0";
+  private LocalDynamicBloomFilter localDynamicBloomFilter;
 
 Review comment:
   why do we have to both extend and wrap `LocalDynamicBloomFilter` ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353200854
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/LocalFilter.java
 ##
 @@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bloom.filter;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.List;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.util.bloom.HashFunction;
+import org.apache.hadoop.util.bloom.Key;
+import org.apache.hadoop.util.hash.Hash;
+
+abstract class LocalFilter implements Writable {
 
 Review comment:
   do we really need to get this abstract class as well?  did we make any 
changes to the hadoop class? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353198094
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/HoodieDynamicBoundedBloomFilter.java
 ##
 @@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bloom.filter;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import javax.xml.bind.DatatypeConverter;
+import org.apache.hadoop.util.bloom.Key;
+import org.apache.hudi.exception.HoodieIndexException;
+
+/**
+ * Hoodie's dynamic bloom bounded bloom filter
+ */
+public class HoodieDynamicBoundedBloomFilter extends LocalDynamicBloomFilter 
implements BloomFilter {
+
+  public static final String TYPE_CODE_PREFIX = "DYNAMIC";
 
 Review comment:
   why not a ENUM  for type code? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353196338
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##
 @@ -51,6 +52,11 @@
   // TODO: On by default. Once stable, we will remove the other mode.
   public static final String BLOOM_INDEX_BUCKETIZED_CHECKING_PROP = 
"hoodie.bloom.index.bucketized.checking";
   public static final String DEFAULT_BLOOM_INDEX_BUCKETIZED_CHECKING = "true";
+  public static final String BLOOM_INDEX_FILTER_TYPE_PROP = 
"hoodie.bloom.index.filter.type.prop";
+  public static final String DEFAULT_BLOOM_INDEX_FILTER_TYPE_PROP = 
SimpleBloomFilter.TYPE_CODE;
 
 Review comment:
   we have a certain convention to naming property and the default. please 
follow the same to keep consistent.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353195684
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##
 @@ -51,6 +52,11 @@
   // TODO: On by default. Once stable, we will remove the other mode.
   public static final String BLOOM_INDEX_BUCKETIZED_CHECKING_PROP = 
"hoodie.bloom.index.bucketized.checking";
   public static final String DEFAULT_BLOOM_INDEX_BUCKETIZED_CHECKING = "true";
+  public static final String BLOOM_INDEX_FILTER_TYPE_PROP = 
"hoodie.bloom.index.filter.type.prop";
+  public static final String DEFAULT_BLOOM_INDEX_FILTER_TYPE_PROP = 
SimpleBloomFilter.TYPE_CODE;
+  public static final String DYNAMIC_BLOOM_FILTER_MAX_ENTRIES = 
"hoodie.index.dynamic.bloom.max.entries";
 
 Review comment:
   consistent naming as `hoodie.bloom.index.filter.dynamic.max.entries` ? (its 
already too long :/) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r353205531
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilterUtils.java
 ##
 @@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bloom.filter;
+
+/**
+ * Bloom filter utils
+ */
+class BloomFilterUtils {
+
+  /**
+   * Used in computing the optimal Bloom filter size. This approximately 
equals 0.480453.
+   */
+  private static final double LOG2_SQUARED = Math.log(2) * Math.log(2);
+
+  /**
+   * @return the bitsize given the total number of entries and error rate
+   */
+  static int getBitSize(int numEntries, double errorRate) {
 
 Review comment:
   orthogonal comment.. while we are this deep, we should also understand how 
correct this bitSizing is also.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1076: [HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread GitBox
lamber-ken edited a comment on issue #1076: [HUDI-365] Refactor hudi-cli based 
on new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1076#issuecomment-561157290
 
 
    , checked `mvn -pl hudi-cli checkstyle:check`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] Adding Delete() support to DeltaStreamer

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] 
Adding Delete() support to DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1073#discussion_r353190815
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/SourceFormatAdapter.java
 ##
 @@ -50,10 +50,6 @@ public SourceFormatAdapter(Source source) {
 
   /**
* Fetch new data in avro format. If the source provides data in different 
format, they are translated to Avro format
-   * 
-   * @param lastCkptStr
 
 Review comment:
   why these changes>? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] Adding Delete() support to DeltaStreamer

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] 
Adding Delete() support to DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1073#discussion_r353187310
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -136,6 +151,9 @@ public static GenericRecord generateGenericRecord(String 
rowKey, String riderNam
 rec.put("end_lat", rand.nextDouble());
 rec.put("end_lon", rand.nextDouble());
 rec.put("fare", rand.nextDouble() * 100);
+if (isDeleteRecord) {
 
 Review comment:
   can we test with all other fields null, in case of deletes.. (ofc except 
_row_key and what you need to generate partitionpath) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] Adding Delete() support to DeltaStreamer

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] 
Adding Delete() support to DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1073#discussion_r353188392
 
 

 ##
 File path: 
hudi-spark/src/main/java/org/apache/hudi/OverwriteWithLatestAvroPayload.java
 ##
 @@ -60,7 +60,18 @@ public OverwriteWithLatestAvroPayload 
preCombine(OverwriteWithLatestAvroPayload
   @Override
   public Option combineAndGetUpdateValue(IndexedRecord 
currentValue, Schema schema) throws IOException {
 // combining strategy here trivially ignores currentValue on disk and 
writes this record
-return getInsertValue(schema);
+Object deleteMarker = null;
+try {
+  deleteMarker = DataSourceUtils.getNestedFieldVal(
+  genericRecord, "_hoodie_delete_marker");
+} catch (HoodieException e) {
 
 Review comment:
   is there a better way to detect not found than throwing an exception.. this 
happens in the fast path, so may not be ideal to do it, this way


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] Adding Delete() support to DeltaStreamer

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] 
Adding Delete() support to DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1073#discussion_r353190196
 
 

 ##
 File path: 
hudi-spark/src/main/java/org/apache/hudi/OverwriteWithLatestAvroPayload.java
 ##
 @@ -60,7 +60,18 @@ public OverwriteWithLatestAvroPayload 
preCombine(OverwriteWithLatestAvroPayload
   @Override
   public Option combineAndGetUpdateValue(IndexedRecord 
currentValue, Schema schema) throws IOException {
 // combining strategy here trivially ignores currentValue on disk and 
writes this record
-return getInsertValue(schema);
+Object deleteMarker = null;
 
 Review comment:
   as a default method? do you mean, having a base method there which you can 
override as needed? 
   I prefer to leave it to the payload implementation as is now.. the API 
supports ability to deletes anyway


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] Adding Delete() support to DeltaStreamer

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] 
Adding Delete() support to DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1073#discussion_r353190622
 
 

 ##
 File path: 
hudi-spark/src/main/java/org/apache/hudi/OverwriteWithLatestAvroPayload.java
 ##
 @@ -60,7 +60,18 @@ public OverwriteWithLatestAvroPayload 
preCombine(OverwriteWithLatestAvroPayload
   @Override
   public Option combineAndGetUpdateValue(IndexedRecord 
currentValue, Schema schema) throws IOException {
 // combining strategy here trivially ignores currentValue on disk and 
writes this record
-return getInsertValue(schema);
+Object deleteMarker = null;
+try {
+  deleteMarker = DataSourceUtils.getNestedFieldVal(
+  genericRecord, "_hoodie_delete_marker");
+} catch (HoodieException e) {
+  // ignore if not found
+}
+if (deleteMarker != null && (boolean) deleteMarker == true) {
 
 Review comment:
   just `... (boolean) deleteMarker)` ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] Adding Delete() support to DeltaStreamer

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] 
Adding Delete() support to DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1073#discussion_r353186790
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -77,9 +77,10 @@
   + "{\"name\": \"rider\", \"type\": \"string\"}," + "{\"name\": 
\"driver\", \"type\": \"string\"},"
   + "{\"name\": \"begin_lat\", \"type\": \"double\"}," + "{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
   + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
-  + "{\"name\":\"fare\",\"type\": \"double\"}]}";
+  + "{\"name\":\"fare\",\"type\": \"double\"},"
+  + "{\"name\": \"_hoodie_delete_marker\", \"type\": 
[\"null\",\"string\"], \"default\": null} ]}";
 
 Review comment:
   why string and not boolean? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] Adding Delete() support to DeltaStreamer

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1073: [WIP] [HUDI-377] 
Adding Delete() support to DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1073#discussion_r353187976
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/BaseAvroPayload.java
 ##
 @@ -30,6 +30,8 @@
  */
 public abstract class BaseAvroPayload implements Serializable {
 
+  protected final GenericRecord genericRecord;
 
 Review comment:
   this is problematic.. Shuffling avro has its own issues.. that's why we only 
shuffle bytes and deserialize lazily.. Lets keep it that way and remove the 
member variable


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1075: [HUDI-114]: added option to overwrite payload implementation in hoodie.properties file

2019-12-03 Thread GitBox
pratyakshsharma commented on a change in pull request #1075: [HUDI-114]: added 
option to overwrite payload implementation in hoodie.properties file
URL: https://github.com/apache/incubator-hudi/pull/1075#discussion_r353191882
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -228,6 +228,10 @@ public Operation convert(String value) throws 
ParameterException {
 + " source-fetch -> Transform -> Hudi Write in loop")
 public Boolean continuousMode = false;
 
+@Parameter(names = {"--update-payload-class"}, description = "Update 
payload class in hoodie.properties file if needed, "
 
 Review comment:
   @vinothchandar someone might give different payload class by mistake as 
well. Just to be sure he/she really wants to update it in hoodie.properties 
file, I added this flag here. The same was discussed with @n3nash.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1075: [HUDI-114]: added option to overwrite payload implementation in hoodie.properties file

2019-12-03 Thread GitBox
vinothchandar commented on a change in pull request #1075: [HUDI-114]: added 
option to overwrite payload implementation in hoodie.properties file
URL: https://github.com/apache/incubator-hudi/pull/1075#discussion_r353185291
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -228,6 +228,10 @@ public Operation convert(String value) throws 
ParameterException {
 + " source-fetch -> Transform -> Hudi Write in loop")
 public Boolean continuousMode = false;
 
+@Parameter(names = {"--update-payload-class"}, description = "Update 
payload class in hoodie.properties file if needed, "
 
 Review comment:
   can we do this check automatically by reading the `hoodie.properties` file 
value and only overwrite if the class name is different?  I am not sure adding 
an option is the way to go here.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1076: [HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread GitBox
vinothchandar commented on issue #1076: [HUDI-365] Refactor hudi-cli based on 
new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1076#issuecomment-561169370
 
 
   @yanghua could you review this one please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1076: [HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread GitBox
lamber-ken commented on issue #1076: [HUDI-365] Refactor hudi-cli based on new 
ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1076#issuecomment-561157290
 
 
    


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986879#comment-16986879
 ] 

lamber-ken commented on HUDI-365:
-

(y)(y)

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2019-12-03-20-58-20-503.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986873#comment-16986873
 ] 

lamber-ken edited comment on HUDI-365 at 12/3/19 1:00 PM:
--

hi, [~gurudatt], some tips may help you to work skillfully. The new ImportOrder 
rule split import statements into groups and groups are separated by one blank 
line. These groups are 1) org.apache.hudi   2) third party imports   3) javax   
4) java   5) static

!image-2019-12-03-20-58-20-503.png!

 

 


was (Author: lamber-ken):
hi, [~gurudatt], some tips may help you to work quickly. The new ImportOrder 
rule split import statements into groups and groups are separated by one blank 
line. These groups are 1) org.apache.hudi   2) third party imports   3) javax   
4) java   5) static

!image-2019-12-03-20-58-20-503.png!




 

 

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2019-12-03-20-58-20-503.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread Gurudatt Kulkarni (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986878#comment-16986878
 ] 

Gurudatt Kulkarni commented on HUDI-365:


[~lamber-ken] I have just raised a PR. Check it out. It wasn't much, so did it 
just now :)

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2019-12-03-20-58-20-503.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-365:

Attachment: image-2019-12-03-20-58-20-503.png

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2019-12-03-20-58-20-503.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986873#comment-16986873
 ] 

lamber-ken commented on HUDI-365:
-

hi, [~gurudatt], some tips may help you to work quickly. The new ImportOrder 
rule split import statements into groups and groups are separated by one blank 
line. These groups are 1) org.apache.hudi   2) third party imports   3) javax   
4) java   5) static

!image-2019-12-03-20-58-20-503.png!




 

 

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2019-12-03-20-58-20-503.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-365:

Labels: pull-request-available  (was: )

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] Guru107 opened a new pull request #1076: [HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread GitBox
Guru107 opened a new pull request #1076: [HUDI-365] Refactor hudi-cli based on 
new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1076
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *To refactor hudi-cli based on new ImportOrder code style rule*
   
   ## Brief change log
   
 - *Modify ImportOrder based on checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1057: Hudi Test Suite

2019-12-03 Thread GitBox
yanghua commented on a change in pull request #1057: Hudi Test Suite
URL: https://github.com/apache/incubator-hudi/pull/1057#discussion_r353159688
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/converter/Converter.java
 ##
 @@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.converter;
+
+import java.io.Serializable;
+import org.apache.spark.api.java.JavaRDD;
+
+/**
+ * Implementations of {@link Converter} will convert data from one format to 
another
+ *
+ * @param  Input Data Type
+ * @param  Output Data Type
+ */
+public interface Converter extends Serializable {
 
 Review comment:
   @vinothchandar Still want to get an answer to this question.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1057: Hudi Test Suite

2019-12-03 Thread GitBox
yanghua commented on a change in pull request #1057: Hudi Test Suite
URL: https://github.com/apache/incubator-hudi/pull/1057#discussion_r353159688
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/converter/Converter.java
 ##
 @@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.converter;
+
+import java.io.Serializable;
+import org.apache.spark.api.java.JavaRDD;
+
+/**
+ * Implementations of {@link Converter} will convert data from one format to 
another
+ *
+ * @param  Input Data Type
+ * @param  Output Data Type
+ */
+public interface Converter extends Serializable {
 
 Review comment:
   @vinothchandar Still want to get an answer of this question.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-233) Redo log statements using SLF4J

2019-12-03 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986863#comment-16986863
 ] 

leesf commented on HUDI-233:


Is the time to start the work? [~vinoth]

> Redo log statements using SLF4J 
> 
>
> Key: HUDI-233
> URL: https://issues.apache.org/jira/browse/HUDI-233
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie, Performance
>Affects Versions: 0.5.0
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> Currently we are not employing variable substitution aggresively in the 
> project.  ala 
> {code:java}
> LogManager.getLogger(SomeName.class.getName()).info("Message: {}, Detail: 
> {}", message, detail);
> {code}
> This can improve performance since the string concatenation is deferrable to 
> when the logging is actually in effect.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986854#comment-16986854
 ] 

lamber-ken commented on HUDI-365:
-

(y)(y)(y)

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread Gurudatt Kulkarni (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986838#comment-16986838
 ] 

Gurudatt Kulkarni commented on HUDI-365:


[~lamber-ken] I will push it by tomorrow. I was caught up with some other 
stuff. 

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch hudi_test_suite updated: Rename module name from hudi-bench to hudi-end-to-end-tests

2019-12-03 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch hudi_test_suite
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/hudi_test_suite by this push:
 new 4ac72f4  Rename module name from hudi-bench to hudi-end-to-end-tests
4ac72f4 is described below

commit 4ac72f4f62800725a469b9a8bc8110acb0b18fb1
Author: yanghua 
AuthorDate: Tue Dec 3 19:44:15 2019 +0800

Rename module name from hudi-bench to hudi-end-to-end-tests
---
 docker/hoodie/hadoop/hive_base/Dockerfile  |  2 +-
 docker/hoodie/hadoop/hive_base/pom.xml |  4 ++--
 {hudi-bench => hudi-end-to-end-tests}/pom.xml  |  2 +-
 .../prepare_integration_suite.sh   |  4 ++--
 .../apache/hudi/e2e}/DFSDeltaWriterAdapter.java|  6 +++---
 .../apache/hudi/e2e}/DFSSparkAvroDeltaWriter.java  |  8 
 .../org/apache/hudi/e2e}/DeltaInputFormat.java |  2 +-
 .../java/org/apache/hudi/e2e}/DeltaOutputType.java |  2 +-
 .../org/apache/hudi/e2e}/DeltaWriterAdapter.java   |  4 ++--
 .../org/apache/hudi/e2e}/DeltaWriterFactory.java   | 10 -
 .../hudi/e2e}/configuration/DFSDeltaConfig.java|  6 +++---
 .../hudi/e2e}/configuration/DeltaConfig.java   |  6 +++---
 .../hudi/e2e}/converter/UpdateConverter.java   |  6 +++---
 .../java/org/apache/hudi/e2e}/dag/DagUtils.java|  9 
 .../org/apache/hudi/e2e}/dag/ExecutionContext.java |  9 
 .../java/org/apache/hudi/e2e}/dag/WorkflowDag.java |  4 ++--
 .../apache/hudi/e2e}/dag/WorkflowDagGenerator.java | 12 +--
 .../apache/hudi/e2e}/dag/nodes/BulkInsertNode.java |  6 +++---
 .../org/apache/hudi/e2e}/dag/nodes/CleanNode.java  |  4 ++--
 .../apache/hudi/e2e}/dag/nodes/CompactNode.java|  6 +++---
 .../org/apache/hudi/e2e}/dag/nodes/DagNode.java|  6 +++---
 .../apache/hudi/e2e}/dag/nodes/HiveQueryNode.java  |  8 
 .../apache/hudi/e2e}/dag/nodes/HiveSyncNode.java   |  8 
 .../org/apache/hudi/e2e}/dag/nodes/InsertNode.java | 10 -
 .../apache/hudi/e2e}/dag/nodes/RollbackNode.java   |  6 +++---
 .../hudi/e2e}/dag/nodes/ScheduleCompactNode.java   |  6 +++---
 .../hudi/e2e}/dag/nodes/SparkSQLQueryNode.java |  8 
 .../org/apache/hudi/e2e}/dag/nodes/UpsertNode.java |  8 
 .../apache/hudi/e2e}/dag/nodes/ValidateNode.java   |  6 +++---
 .../hudi/e2e}/dag/scheduler/DagScheduler.java  | 12 +--
 .../apache/hudi/e2e}/generator/DeltaGenerator.java | 24 +++---
 .../FlexibleSchemaRecordGenerationIterator.java|  2 +-
 .../GenericRecordFullPayloadGenerator.java |  2 +-
 .../GenericRecordFullPayloadSizeEstimator.java |  2 +-
 .../GenericRecordPartialPayloadGenerator.java  |  2 +-
 .../generator/LazyRecordGeneratorIterator.java |  2 +-
 .../e2e}/generator/UpdateGeneratorIterator.java|  2 +-
 .../e2e}/helpers/DFSTestSuitePathSelector.java |  2 +-
 .../hudi/e2e}/helpers/HiveServiceProvider.java |  6 +++---
 .../hudi/e2e}/job/HoodieDeltaStreamerWrapper.java  |  2 +-
 .../apache/hudi/e2e}/job/HoodieTestSuiteJob.java   | 24 +++---
 .../hudi/e2e}/reader/DFSAvroDeltaInputReader.java  |  8 
 .../hudi/e2e}/reader/DFSDeltaInputReader.java  |  2 +-
 .../e2e}/reader/DFSHoodieDatasetInputReader.java   |  2 +-
 .../e2e}/reader/DFSParquetDeltaInputReader.java|  6 +++---
 .../apache/hudi/e2e}/reader/DeltaInputReader.java  |  2 +-
 .../apache/hudi/e2e}/reader/SparkBasedReader.java  |  2 +-
 .../hudi/e2e}/writer/AvroDeltaInputWriter.java |  2 +-
 .../apache/hudi/e2e}/writer/DeltaInputWriter.java  |  2 +-
 .../org/apache/hudi/e2e}/writer/DeltaWriter.java   |  6 +++---
 .../hudi/e2e}/writer/FileDeltaInputWriter.java |  2 +-
 .../e2e}/writer/SparkAvroDeltaInputWriter.java |  2 +-
 .../org/apache/hudi/e2e}/writer/WriteStats.java|  2 +-
 .../hudi/e2e}/TestDFSDeltaWriterAdapter.java   | 16 +++
 .../apache/hudi/e2e}/TestFileDeltaInputWriter.java | 14 ++---
 .../e2e}/configuration/TestWorkflowBuilder.java| 12 +--
 .../hudi/e2e}/converter/TestUpdateConverter.java   |  4 ++--
 .../org/apache/hudi/e2e}/dag/TestComplexDag.java   | 14 +++--
 .../org/apache/hudi/e2e}/dag/TestDagUtils.java | 13 +++-
 .../org/apache/hudi/e2e}/dag/TestHiveSyncDag.java  | 14 +++--
 .../apache/hudi/e2e}/dag/TestInsertOnlyDag.java| 10 +
 .../apache/hudi/e2e}/dag/TestInsertUpsertDag.java  | 12 ++-
 .../TestGenericRecordPayloadEstimator.java |  7 ---
 .../TestGenericRecordPayloadGenerator.java | 16 ---
 .../hudi/e2e}/generator/TestWorkloadGenerator.java | 22 ++--
 .../hudi/e2e}/job/TestHoodieTestSuiteJob.java  | 22 ++--
 .../e2e}/reader/TestDFSAvroDeltaInputReader.java   |  4 ++--
 .../reader/TestDFSHoodieDatasetInputReader.java|  5 +++--
 

[jira] [Comment Edited] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986793#comment-16986793
 ] 

lamber-ken edited comment on HUDI-365 at 12/3/19 11:02 AM:
---

hello, [~gurudatt], are you still working on this? the hudi-cli is the last one 
module which needs to refactor for ImportOrder code style rule. After this 
work, we can set severity to error. :):):)

If you don't have time to refactor, I'm happy to help you to finish this one.


was (Author: lamber-ken):
hello, [~gurudatt], are you still working on this? the hudi-cli is the last one 
module which needs to refactor for ImportOrder code style rule. After this 
work, we can set severity to error. :):):)

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986793#comment-16986793
 ] 

lamber-ken edited comment on HUDI-365 at 12/3/19 11:00 AM:
---

hello, [~gurudatt], are you still working on this? the hudi-cli is the last one 
module which needs to refactor for ImportOrder code style rule. After this 
work, we can set severity to error. :):):)


was (Author: lamber-ken):
hello, [~gurudatt], are you still working on this? the hudi-cli is the lastest 
module which needs to refactor for ImportOrder code style rule. After this 
work, we can set severity to error. :):):)

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-365) Refactor hudi-cli based on new ImportOrder code style rule

2019-12-03 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986793#comment-16986793
 ] 

lamber-ken commented on HUDI-365:
-

hello, [~gurudatt], are you still working on this? the hudi-cli is the lastest 
module which needs to refactor for ImportOrder code style rule. After this 
work, we can set severity to error. :):):)

> Refactor hudi-cli based on new ImportOrder code style rule
> --
>
> Key: HUDI-365
> URL: https://issues.apache.org/jira/browse/HUDI-365
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Gurudatt Kulkarni
>Assignee: Gurudatt Kulkarni
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf commented on issue #1062: [HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path

2019-12-03 Thread GitBox
leesf commented on issue #1062: [HUDI-294] Delete Paths written in Cleaner plan 
needs to be relative to partition-path
URL: https://github.com/apache/incubator-hudi/pull/1062#issuecomment-561115499
 
 
   @bvaradar Thanks for the reminder and check. Updated the PR to address your 
comments. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1066: [checkstyle] Add ConstantName java checkstyle rule

2019-12-03 Thread GitBox
lamber-ken commented on issue #1066: [checkstyle] Add ConstantName java 
checkstyle rule
URL: https://github.com/apache/incubator-hudi/pull/1066#issuecomment-561112197
 
 
   hi, @leesf help to review, thanks  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-114) Allow for clients to overwrite the payload implementation in hoodie.properties

2019-12-03 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986775#comment-16986775
 ] 

Pratyaksh Sharma commented on HUDI-114:
---

[~nishith29] Raised a PR for code changes, will be raising one for doc changes 
as well in some time. 

> Allow for clients to overwrite the payload implementation in hoodie.properties
> --
>
> Key: HUDI-114
> URL: https://issues.apache.org/jira/browse/HUDI-114
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Reporter: Nishith Agarwal
>Assignee: Pratyaksh Sharma
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now, once the payload class is set once in hoodie.properties, it cannot 
> be changed. In some cases, if a code refactor is done and the jar updated, 
> one may need to pass the new payload class name.
> Also, fix picking up the payload name for datasource API. By default 
> HoodieAvroPayload is written whereas for datasource API default is 
> OverwriteLatestAvroPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-114) Allow for clients to overwrite the payload implementation in hoodie.properties

2019-12-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-114:

Labels: pull-request-available  (was: )

> Allow for clients to overwrite the payload implementation in hoodie.properties
> --
>
> Key: HUDI-114
> URL: https://issues.apache.org/jira/browse/HUDI-114
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Reporter: Nishith Agarwal
>Assignee: Pratyaksh Sharma
>Priority: Minor
>  Labels: pull-request-available
>
> Right now, once the payload class is set once in hoodie.properties, it cannot 
> be changed. In some cases, if a code refactor is done and the jar updated, 
> one may need to pass the new payload class name.
> Also, fix picking up the payload name for datasource API. By default 
> HoodieAvroPayload is written whereas for datasource API default is 
> OverwriteLatestAvroPayload



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma opened a new pull request #1075: [HUDI-114]: added option to overwrite payload implementation in hoodie.properties file

2019-12-03 Thread GitBox
pratyakshsharma opened a new pull request #1075: [HUDI-114]: added option to 
overwrite payload implementation in hoodie.properties file
URL: https://github.com/apache/incubator-hudi/pull/1075
 
 
   Key changes - 
   
   1. Updated the default payload class to OverwriteWithLatestAvroPayload in 
HoodieTableConfig and HoodieCompactionConfig classes
   2. Moved BaseAvroPayload and OverwriteWithLatestAvroPayload from hudi-spark 
to hudi-common module so that they can be accessed from module
   3. Added flag which enables clients to overwrite payload implementation in 
hoodie.properties file.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (89f0968 -> 845e261)

2019-12-03 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 89f0968  [DOCS] Update the build source link (#1071)
 add 845e261  [MINOR] Update some urls from http to https in the README 
file (#1074)

No new revisions were added by this update.

Summary of changes:
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)



[GitHub] [incubator-hudi] yanghua merged pull request #1074: [MINOR] Update some url from http to https in the README file

2019-12-03 Thread GitBox
yanghua merged pull request #1074: [MINOR] Update some url from http to https 
in the README file
URL: https://github.com/apache/incubator-hudi/pull/1074
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua edited a comment on issue #1057: Hudi Test Suite

2019-12-03 Thread GitBox
yanghua edited a comment on issue #1057: Hudi Test Suite
URL: https://github.com/apache/incubator-hudi/pull/1057#issuecomment-561077900
 
 
   I have pushed a new branch named `hudi_test_suite` and will keep this PR 
syncing with the branch.
   This PR uses to discuss. If there is no objection, will delete the old 
branch `hudi_test_suite_refactor`. WDYT? @vinothchandar @n3nash 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1057: Hudi Test Suite

2019-12-03 Thread GitBox
yanghua commented on issue #1057: Hudi Test Suite
URL: https://github.com/apache/incubator-hudi/pull/1057#issuecomment-561077900
 
 
   I have pushed a new branch named `hudi_test_suite` and will keep this PR 
syncing with the branch.
   This PR used to discuss. If there is no objection, will delete the old 
branch `hudi_test_suite_refactor`. WDYT? @vinothchandar @n3nash 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch hudi_test_suite created (now afe00ff)

2019-12-03 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


  at afe00ff  Hudi Test Suite - Flexible schema payload generation  
   - Different types of workload generation such as inserts, upserts etc - 
Post process actions to perform validations - Interoperability of test 
suite to use HoodieWriteClient and HoodieDeltaStreamer so both code paths can 
be tested - Custom workload sequence generator - Ability to perform 
parallel operations, such as upsert and compaction

No new revisions were added by this update.



[GitHub] [incubator-hudi] arw357 commented on issue #812: KryoException: Unable to find class

2019-12-03 Thread GitBox
arw357 commented on issue #812: KryoException: Unable to find class
URL: https://github.com/apache/incubator-hudi/issues/812#issuecomment-561074111
 
 
   I tried it and it does not crash - so it's ok to be cloesd


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on issue #1062: [HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path

2019-12-03 Thread GitBox
bvaradar commented on issue #1062: [HUDI-294] Delete Paths written in Cleaner 
plan needs to be relative to partition-path
URL: https://github.com/apache/incubator-hudi/pull/1062#issuecomment-561057985
 
 
   > > Thanks @leesf : This is a good starting point. You would need to make 
further changes in
   > > 
   > > 1. HoodieCleanHelper.getDeletePaths need to return only relative paths.
   > > 2. HoodieCopyOnWriteTable.scheduleClean to return only relative paths.
   > > 3. (1) and (2) will ensure HoodieCleanClient.scheduleClean will store 
cleaner plans with relative paths.
   > > 4. IncrementalTimelineSyncFileSystemView.addCleanInstant needs to handle 
relative partition paths
   > 
   > Hi @bvaradar . Currently the `HoodieCleanHelper.getDeletePaths` (point 1) 
and `HoodieCopyOnWriteTable.scheduleClean` (point 2) return relative path 
already. Also the currently 
`IncrementalTimelineSyncFileSystemView.addCleanInstant` only handles relative 
path. Please correct me if I am wrong.
   > 
   > And I addressed other comments.
   
   @leesf : Sorry, my bad regarding scheduleClean comments. I confused 
HoodieCleanerPlan with HoodieCleanMetadata. HoodieCleanerPlan is newly added as 
part of 0.5.1 and stores relative paths and do not require migration. 
HoodieCleanMetadata though requires migration. I looked at  
IncrementalTimelineSyncFileSystemView.addCleanInstant and believe this would 
need more handling. This method calls a common method : 
IncrementalTimelineSyncFileSystemView.removeFileSlicesForPartition which is 
shared by rollback/restore code and assumes full path. 
   
   ```
FileStatus[] statuses = paths.stream().map(p -> {
   FileStatus status = new FileStatus();
   status.setPath(new Path(p));
   return status;
 }).toArray(FileStatus[]::new);
   ```
   To make it easy, you might need to create absolute paths in addCleanInstant 
before calling removeFileSlicesForPartition. Can you check this part of the 
code to see if it makes sense.
   
   Thanks for doing this change.
   
   Balaji.V


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services