[GitHub] [incubator-hudi] vinothchandar commented on issue #1550: Hudi 0.5.2 inability save complex type with nullable = true [SUPPORT]

2020-05-06 Thread GitBox


vinothchandar commented on issue #1550:
URL: https://github.com/apache/incubator-hudi/issues/1550#issuecomment-625044161


   You can follow this here btw 
https://lists.apache.org/thread.html/r1fb5ad5547f55f40b20306dac90a711c9c0e29f6855f63b6b2118987%40%3Cdev.hudi.apache.org%3E
 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nandini57 edited a comment on issue #1582: [SUPPORT] PreCombineAndUpdate in Payload

2020-05-06 Thread GitBox


nandini57 edited a comment on issue #1582:
URL: https://github.com/apache/incubator-hudi/issues/1582#issuecomment-624789029


   Probably switching to parquet format instead of hudi and doing a 
spark.read.parquet(partitionpath).dropduplicates where commit_time= X is an 
option? The following works if i want to go back to commit X and have a view of 
data.However,the same with hudi format doesn't provide me the right view as of 
commit X
   
def audit(spark: SparkSession, partitionPath: String, tablePath: String, 
commitTime: String): Unit = {
   val hoodieROViewDF = spark.read.option("inferSchema", 
true).parquet(tablePath +  "/" + partitionPath)
   hoodieROViewDF.createOrReplaceTempView("hoodie_ro")
   spark.sql("select * from hoodie_ro where _hoodie_commit_time =" + 
commitTime).dropDuplicates().show()
 }
   
   Did a lil bit digging and the following code in HoodieROTablePathFilter 
seems to be taking only latest BaseFile and thus dropping the other files.The 
impact of this is in my case ,i get incorrect view as of time X as it is 
reading latest file which has 2 records as of time X and 1 is upserted and got 
a new commit time.Is the understanding correct?
   
   How do i get around this?Can i use a custom path filter?
   
   HoodieTableMetaClient metaClient = new 
HoodieTableMetaClient(fs.getConf(), baseDir.toString());
   HoodieTableFileSystemView fsView = new 
HoodieTableFileSystemView(metaClient,
   
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(), 
fs.listStatus(folder));
   List latestFiles = 
fsView.getLatestBaseFiles().collect(Collectors.toList());



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nandini57 edited a comment on issue #1582: [SUPPORT] PreCombineAndUpdate in Payload

2020-05-06 Thread GitBox


nandini57 edited a comment on issue #1582:
URL: https://github.com/apache/incubator-hudi/issues/1582#issuecomment-624789029


   Probably switching to parquet format instead of hudi and doing a 
spark.read.parquet(partitionpath).dropduplicates where commit_time= X is an 
option? The following works if i want to go back to commit X and have a view of 
data.However,the same with hudi format doesn't provide me the right view as of 
commit X
   
def audit(spark: SparkSession, partitionPath: String, tablePath: String, 
commitTime: String): Unit = {
   val hoodieROViewDF = spark.read.option("inferSchema", 
true).parquet(tablePath +  "/" + partitionPath)
   hoodieROViewDF.createOrReplaceTempView("hoodie_ro")
   spark.sql("select * from hoodie_ro where _hoodie_commit_time =" + 
commitTime).dropDuplicates().show()
 }
   
   Did a lil bit digging and the following code in HoodieROTablePathFilter 
seems to be taking only latest BaseFile and thus dropping the other files.The 
impact of this is in my case ,i get incorrect view as of time X as it is 
reading latest file which has 2 records as of time X as 1 is upserted and have 
a new commit time.Is the understanding correct?
   
   How do i get around this?Can i use a custom path filter?
   
   HoodieTableMetaClient metaClient = new 
HoodieTableMetaClient(fs.getConf(), baseDir.toString());
   HoodieTableFileSystemView fsView = new 
HoodieTableFileSystemView(metaClient,
   
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(), 
fs.listStatus(folder));
   List latestFiles = 
fsView.getLatestBaseFiles().collect(Collectors.toList());



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch hudi_test_suite_refactor updated (33590b7 -> 6f4547d)

2020-05-06 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 33590b7  [MINOR] Code cleanup for DeltaConfig
 add 6f4547d  [MINOR] Code cleanup for dag package

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/testsuite/dag/DagUtils.java   | 6 +++---
 .../src/main/java/org/apache/hudi/testsuite/dag/nodes/DagNode.java  | 6 +++---
 .../java/org/apache/hudi/testsuite/dag/nodes/SparkSQLQueryNode.java | 2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)



[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-05-06 Thread GitBox


yanghua commented on a change in pull request #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#discussion_r421223899



##
File path: 
hudi-test-suite/src/main/java/org/apache/hudi/testsuite/dag/nodes/InsertNode.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.testsuite.dag.nodes;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.testsuite.configuration.DeltaConfig.Config;
+import org.apache.hudi.testsuite.dag.ExecutionContext;
+import org.apache.hudi.testsuite.generator.DeltaGenerator;
+import org.apache.hudi.testsuite.writer.DeltaWriter;
+
+import org.apache.spark.api.java.JavaRDD;
+
+public class InsertNode extends DagNode> {
+
+  public InsertNode(Config config) {
+this.config = config;
+  }
+
+  @Override
+  public void execute(ExecutionContext executionContext) throws Exception {
+generate(executionContext.getDeltaGenerator());
+log.info("Configs => " + this.config);
+if (!config.isDisableIngest()) {
+  log.info(String.format("- inserting input data %s 
--", this.getName()));
+  Option commitTime = 
executionContext.getDeltaWriter().startCommit();
+  JavaRDD writeStatus = 
ingest(executionContext.getDeltaWriter(), commitTime);
+  executionContext.getDeltaWriter().commit(writeStatus, commitTime);
+  this.result = writeStatus;
+}
+validate();

Review comment:
   What's the meaning of this invocation? We did not use its result, right?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #270

2020-05-06 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.36 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-05-06 Thread GitBox


yanghua commented on a change in pull request #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#discussion_r421219556



##
File path: 
hudi-test-suite/src/main/java/org/apache/hudi/testsuite/converter/UpdateConverter.java
##
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.testsuite.converter;
+
+import org.apache.hudi.testsuite.generator.LazyRecordGeneratorIterator;
+import org.apache.hudi.testsuite.generator.UpdateGeneratorIterator;
+import org.apache.hudi.utilities.converter.Converter;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.List;
+
+/**
+ * This converter creates an update {@link GenericRecord} from an existing 
{@link GenericRecord}.
+ */
+public class UpdateConverter implements Converter {

Review comment:
   Still have a question about `Converter `: we have removed it from the 
git history. Why we reintroduce it only for one use case? IMO, if it is not in 
our plan. We should remove this? cc @vinothchandar 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nandini57 edited a comment on issue #1582: [SUPPORT] PreCombineAndUpdate in Payload

2020-05-06 Thread GitBox


nandini57 edited a comment on issue #1582:
URL: https://github.com/apache/incubator-hudi/issues/1582#issuecomment-624789029


   Probably switching to parquet format instead of hudi and doing a 
spark.read.parquet(partitionpath).dropduplicates where commit_time= X is an 
option? The following works if i want to go back to commit X and have a view of 
data.However,the same with hudi format doesn't provide me the right view as of 
commit X
   
def audit(spark: SparkSession, partitionPath: String, tablePath: String, 
commitTime: String): Unit = {
   val hoodieROViewDF = spark.read.option("inferSchema", 
true).parquet(tablePath +  "/" + partitionPath)
   hoodieROViewDF.createOrReplaceTempView("hoodie_ro")
   spark.sql("select * from hoodie_ro where _hoodie_commit_time =" + 
commitTime).dropDuplicates().show()
 }
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken edited a comment on pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-05-06 Thread GitBox


lamber-ken edited a comment on pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#issuecomment-622469297


   Hi @vinothchandar @nsivabalan thanks for you review, all review comments are 
addressed.
   
   ### SYNC
   
   |  task   |  status  |
   |    |   |
   |  fix an implicit bug which causes input record repeat  | done and junit 
covered |
   | implement fetchRecordLocation  | done and junit covered  |
   |  junit tests for TestHoodieBloomIndexV2  | done  |
   | junit test for TestHoodieBloomRangeInfoHandle  | done |
   | revert HoodieTimer | done |
   | global HoodieBloomIndexV2 | https://issues.apache.org/jira/browse/HUDI-787 
|
   
   > We wanted a Global index version for HoodieBloomIndexV2
   
   I'm not familiar with how `Global Index` works, so splitting it into a 
separate issue, trying to implement it.
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101207#comment-17101207
 ] 

Yanjia Gary Li commented on HUDI-494:
-

 

Commit 1:
{code:java}

"partitionToWriteStats" : {
"year=2020/month=5/day=0/hour=0" : [ {
  "fileId" : "4aee295a-4bbd-4c74-ba49-f6d50f489524-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/4aee295a-4bbd-4c74-ba49-f6d50f489524-0_0-112-1773_20200504101048.parquet",
  "prevCommit" : "null",
  "numWrites" : 21,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 21,
  "totalWriteBytes" : 14397559,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14397559
}
{code}
Commit2:
{code:java}
  "partitionToWriteStats" : {
"year=2020/month=5/day=0/hour=0" : [ {
  "fileId" : "4aee295a-4bbd-4c74-ba49-f6d50f489524-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/4aee295a-4bbd-4c74-ba49-f6d50f489524-0_0-248-163129_20200505023830.parquet",
  "prevCommit" : "20200504101048",
  "numWrites" : 12817,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 12796,
  "totalWriteBytes" : 16297335,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 16297335
}, {
  "fileId" : "9d0c9e79-00dd-41d2-a217-0944f8428e1c-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/9d0c9e79-00dd-41d2-a217-0944f8428e1c-0_1-248-163130_20200505023830.parquet",
  "prevCommit" : "null",
  "numWrites" : 200,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 200,
  "totalWriteBytes" : 14428883,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14428883
}, {
  "fileId" : "5990beb4-bd0c-40c9-84f1-a4107287971e-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/5990beb4-bd0c-40c9-84f1-a4107287971e-0_2-248-163131_20200505023830.parquet",
  "prevCommit" : "null",
  "numWrites" : 198,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 198,
  "totalWriteBytes" : 14428338,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14428338
}, {
  "fileId" : "673c5550-39c3-4611-ac68-bc0c7da065e2-0",
  "path" : 
"year=2020/month=5/day=0/hour=0/673c5550-39c3-4611-ac68-bc0c7da065e2-0_3-248-163132_20200505023830.parquet",
  "prevCommit" : "null",
  "numWrites" : 179,
  "numDeletes" : 0,
  "numUpdateWrites" : 0,
  "numInserts" : 179,
  "totalWriteBytes" : 14425571,
  "totalWriteErrors" : 0,
  "tempPath" : null,
  "partitionPath" : "year=2020/month=5/day=0/hour=0",
  "totalLogRecords" : 0,
  "totalLogFilesCompacted" : 0,
  "totalLogSizeCompacted" : 0,
  "totalUpdatedRecordsCompacted" : 0,
  "totalLogBlocks" : 0,
  "totalCorruptLogBlock" : 0,
  "totalRollbackBlocks" : 0,
  "fileSizeInBytes" : 14425571
}
{code}
 

 

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> 

[GitHub] [incubator-hudi] codecov-io commented on pull request #1595: [MINOR] Fix travis testing and silence unit tests

2020-05-06 Thread GitBox


codecov-io commented on pull request #1595:
URL: https://github.com/apache/incubator-hudi/pull/1595#issuecomment-624855201


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=h1) 
Report
   > Merging 
[#1595](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/366bb10d8c4fe98424f09a6cf6f4aee7716451a4=desc)
 will **decrease** coverage by `55.19%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1595/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#1595   +/-   ##
   =
   - Coverage 71.87%   16.68%   -55.20% 
   + Complexity 1087  795  -292 
   =
 Files   385  340   -45 
 Lines 1655314996 -1557 
 Branches   1663 1494  -169 
   =
   - Hits  11898 2502 -9396 
   - Misses 392812164 +8236 
   + Partials727  330  -397 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [305 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=footer).
 Last update 

[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1595: [MINOR] Fix travis testing and silence unit tests

2020-05-06 Thread GitBox


codecov-io edited a comment on pull request #1595:
URL: https://github.com/apache/incubator-hudi/pull/1595#issuecomment-624855201


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=h1) 
Report
   > Merging 
[#1595](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/366bb10d8c4fe98424f09a6cf6f4aee7716451a4=desc)
 will **decrease** coverage by `55.19%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1595/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#1595   +/-   ##
   =
   - Coverage 71.87%   16.68%   -55.20% 
   + Complexity 1087  795  -292 
   =
 Files   385  340   -45 
 Lines 1655314996 -1557 
 Branches   1663 1494  -169 
   =
   - Hits  11898 2502 -9396 
   - Misses 392812164 +8236 
   + Partials727  330  -397 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [305 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1595/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1595?src=pr=footer).
 Last update 

[GitHub] [incubator-hudi] allenerb opened a new pull request #1597: Added a MultiFormatTimestampBasedKeyGenerator that allows for multipl…

2020-05-06 Thread GitBox


allenerb opened a new pull request #1597:
URL: https://github.com/apache/incubator-hudi/pull/1597


   …e date/time pattern matching.  This is an alternative to the 
TimestampBasedKeyGenerator that only supports a single format for parsing 
datetime strings
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   Add in functionality to support multiple date/time format parsing when using 
a timestamp as a partitioner field
   
   ## Brief change log
   * Added MultiFormatTimestampBasedKeyGenerator
   * Added Unit Tests to support MultiFormatTimestampBasedKeyGenerator
   
   ## Verify this pull request
   This change added tests and can be verified as follows:
   * Added TestMultiFormatTimestampBasedKeyGenerator Unit Tests
   * Manually verified that the code actually works when deployed to a cluster
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101055#comment-17101055
 ] 

Yanjia Gary Li commented on HUDI-494:
-

Ok, I see what happened here. Root cause is 
[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L214]

So basically commit 1 wrote a very small file(let's say 200 records) to a new 
partition day=05. And then when commit 2 trying to write to day=05, it will 
look up the affected partition and use the Bloom index range from the existing 
files, so it will use 200 here. Commit 2 has much more records than 200, so it 
will create tons of files since the Bloom index range is too small.

I am not really familiar with the indexing part of the code. Please let me know 
if I understand this correctly and we can figure out a fix. [~lamber-ken] 
[~vinoth]

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Status: Open  (was: New)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-494:
---

Assignee: Yanjia Gary Li  (was: Vinoth Chandar)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-603) HoodieDeltaStreamer should periodically fetch table schema update

2020-05-06 Thread Yixue (Andrew) Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101034#comment-17101034
 ] 

Yixue (Andrew) Zhu edited comment on HUDI-603 at 5/6/20, 5:33 PM:
--

I just started working on this, and come up with a seemingly reasonable 
approach:

If some delta stream configuration enabled (continuous mode and a new config 
"providerSchemaChangeSupported"), when provider schema changed, restart the 
program to use the new schema.
 We can use this option for a couple of reasons:
 # Spark serialization of Avro record and schema is optimized when schemas are 
registered before program is executed, i.e. executors are spawned by the driver.
 If we refresh schema w/o recreating SparkConf, which is not supported by Spark 
without restating the program, the serialization optimization would be defeated.
 # It is not frequent for table schema to be updated.

By throwing exception in the DeltaSync::syncOnce(), the following Spark 
configuration would restart the program:
   --conf spark.yarn.max.maxAppAttempts
   --conf spark.yarn.am.attemptFailuresValidityInterval


was (Author: yx3...@gmail.com):
I just started working on this, and come up with a seemingly reasonable 
approach:

If some delta stream configuration enabled, when provider schema changed, 
restart the program to use the new schema.
We can use this option for a couple of reasons:
 # Spark serialization of Avro record and schema is optimized when schemas are 
registered before program is executed, i.e. executors are spawned by the driver.
If we refresh schema w/o recreating SparkConf, which is not supported by Spark 
without restating the program, the serialization optimization would be defeated.
 # It is not frequent for table schema to be updated.

By throwing exception in the DeltaSync::syncOnce(), the following Spark 
configuration would restart the program:
  --conf spark.yarn.max.maxAppAttempts
  --conf spark.yarn.am.attemptFailuresValidityInterval

> HoodieDeltaStreamer should periodically fetch table schema update
> -
>
> Key: HUDI-603
> URL: https://issues.apache.org/jira/browse/HUDI-603
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Yixue Zhu
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: evolution, pull-request-available, schema
>
> HoodieDeltaStreamer create SchemaProvider instance and delegate to DeltaSync 
> for periodical sync. However, default implementation of SchemaProvider does 
> not refresh schema, which can change due to schema evolution. DeltaSync 
> snapshot the schema when it creates writeClient, using the SchemaProvider 
> instance or pick up from source, and the schema for writeClient is not 
> refreshed during the loop of Sync.
> I think this needs to be addressed to support schema evolution fully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nandini57 edited a comment on issue #1582: [SUPPORT] PreCombineAndUpdate in Payload

2020-05-06 Thread GitBox


nandini57 edited a comment on issue #1582:
URL: https://github.com/apache/incubator-hudi/issues/1582#issuecomment-624783699


   Hi Balaji,
   If i rephrase the problem , at commit X, i have 3 records.I do an upsert and 
change 1 record at commit X+1.As of commit X+1,i have the latest view of data 
which is fine.But as of commit X, i don't get the view of data as commit X(the 
original 3 recs).when i do a spark read with parquet format i get a full view 
which includes both X and X+1 but how do i filter as of X
   
   
`+---++--+--+-+---+++---+--+---++
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
|id |name
|team|ts |recordKey |batchId|uniqueCd|
   
+---++--+--+-+---+++---+--+---++
   |20200506122159 |20200506122159_0_4  |5 |-464907625  
  |c7fd919f-0152-438f-b4b6-882464570f67-0_0-34-36_20200506122159.parquet|5  
|sandeep-upserted|hudi|2  |-464907625|18 |18_5|
   |20200506122143 |20200506122143_0_2  |1 |-464907625  
  |c7fd919f-0152-438f-b4b6-882464570f67-0_0-5-5_20200506122143.parquet  |1  
|mandeep |hudi|1  |-464907625|15 |15_1|
   |20200506122143 |20200506122143_0_3  |2 |-464907625  
  |c7fd919f-0152-438f-b4b6-882464570f67-0_0-5-5_20200506122143.parquet  |2  
|jhandeep|modi|2  |-464907625|15 |15_2|
   |20200506122143 |20200506122143_0_1  |5 |-464907625  
  |c7fd919f-0152-438f-b4b6-882464570f67-0_0-5-5_20200506122143.parquet  |5  
|sandeep |hudi|1  |-464907625|15 |15_5|
   |20200506122143 |20200506122143_0_2  |1 |-464907625  
  |c7fd919f-0152-438f-b4b6-882464570f67-0_0-5-5_20200506122143.parquet  |1  
|mandeep |hudi|1  |-464907625|15 |15_1|
   |20200506122143 |20200506122143_0_3  |2 |-464907625  
  |c7fd919f-0152-438f-b4b6-882464570f67-0_0-5-5_20200506122143.parquet  |2  
|jhandeep|modi|2  |-464907625|15 |15_2|
   
+---++--+--+-+---+++---+--+---++
   `
   
   `
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nandini57 commented on issue #1582: [SUPPORT] PreCombineAndUpdate in Payload

2020-05-06 Thread GitBox


nandini57 commented on issue #1582:
URL: https://github.com/apache/incubator-hudi/issues/1582#issuecomment-624783699


   Hi Balaji,
   If i rephrase the problem , at commit X, i have 3 records.I do an upsert and 
change 1 record at commit X+1.As of commit X+1,i have the latest view of data 
which is fine.But as of commit X, i don't get the view of data as commit X(the 
original 3 recs).when i do a spark read with parquet format i get a full view 
which includes both X and X+1 but how do i filter as of X
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100967#comment-17100967
 ] 

Yanjia Gary Li commented on HUDI-494:
-

Hi folks, this issue seems coming back again...

!example2_hdfs.png!

!example2_sparkui.png!

A very small(2GB) upsert job creates 60,000+ files in a single partition and 
gets stuck for 10+ hours. I believe there might be a bug on the BloomIndexing 
stage.

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Attachment: example2_hdfs.png

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-06 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Attachment: example2_sparkui.png

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group

2020-05-06 Thread Roland Johann (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100914#comment-17100914
 ] 

Roland Johann commented on HUDI-864:


The issue is at parquet PARQUET-952, which has been fixed in version 1.11.

Is it possible to upgrade to 1.11 at hudi and spark? Otherwise we need to ask 
the parquet community for a back patch to 1.10.x


> parquet schema conflict: optional binary  (UTF8) is not a group
> ---
>
> Key: HUDI-864
> URL: https://issues.apache.org/jira/browse/HUDI-864
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Roland Johann
>Priority: Major
>
> When dealing with struct types like this
> {code:json}
> {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryResults",
>   "type": {
> "type": "array",
> "elementType": {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryId",
>   "type": "string",
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> },
> "containsNull": true
>   },
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> }
> {code}
> The second ingest batch throws that exception:
> {code}
> ERROR [Executor task launch worker for task 15] 
> commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: operation has failed
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100)
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76)
>   at 
> org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1554: [HUDI-704]Add test for RepairsCommand

2020-05-06 Thread GitBox


yanghua commented on a change in pull request #1554:
URL: https://github.com/apache/incubator-hudi/pull/1554#discussion_r420784920



##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##
@@ -64,19 +69,35 @@ public String deduplicate(
   @CliOption(key = {"repairedOutputPath"}, help = "Location to place the 
repaired files",
   mandatory = true) final String repairedOutputPath,
   @CliOption(key = {"sparkProperties"}, help = "Spark Properties File 
Path",
-  mandatory = true) final String sparkPropertiesPath)
+  unspecifiedDefaultValue = "") String sparkPropertiesPath,
+  @CliOption(key = "sparkMaster", unspecifiedDefaultValue = "", help = 
"Spark Master ") String master,

Review comment:
   `"Spark Master "` -> `"Spark Master"`?

##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##
@@ -64,19 +69,35 @@ public String deduplicate(
   @CliOption(key = {"repairedOutputPath"}, help = "Location to place the 
repaired files",
   mandatory = true) final String repairedOutputPath,
   @CliOption(key = {"sparkProperties"}, help = "Spark Properties File 
Path",
-  mandatory = true) final String sparkPropertiesPath)
+  unspecifiedDefaultValue = "") String sparkPropertiesPath,
+  @CliOption(key = "sparkMaster", unspecifiedDefaultValue = "", help = 
"Spark Master ") String master,
+  @CliOption(key = "sparkMemory", unspecifiedDefaultValue = "4G",
+  help = "Spark executor memory") final String sparkMemory,
+  @CliOption(key = {"dryrun"},
+  help = "Should we actually remove duplicates or just run and store 
result to repairedOutputPath",
+  unspecifiedDefaultValue = "true") final boolean dryRun)
   throws Exception {
+if (StringUtils.isNullOrEmpty(sparkPropertiesPath)) {
+  sparkPropertiesPath =
+  
Utils.getDefaultPropertiesFile(JavaConverters.mapAsScalaMapConverter(System.getenv()).asScala());
+}
+
 SparkLauncher sparkLauncher = SparkUtil.initLauncher(sparkPropertiesPath);
-sparkLauncher.addAppArgs(SparkMain.SparkCommand.DEDUPLICATE.toString(), 
duplicatedPartitionPath, repairedOutputPath,
-HoodieCLI.getTableMetaClient().getBasePath());
+sparkLauncher.addAppArgs(SparkMain.SparkCommand.DEDUPLICATE.toString(), 
master, sparkMemory,

Review comment:
   The same suggestion, we should try to define a data structure? We can 
refactor it later.

##
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##
@@ -73,8 +73,8 @@ public static void main(String[] args) throws Exception {
 returnCode = rollback(jsc, args[1], args[2]);
 break;
   case DEDUPLICATE:
-assert (args.length == 4);
-returnCode = deduplicatePartitionPath(jsc, args[1], args[2], args[3]);
+assert (args.length == 7);
+returnCode = deduplicatePartitionPath(jsc, args[3], args[4], args[5], 
args[6]);

Review comment:
   IMHO, we also need to refactor the arg parse. But not in this PR.

##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/integ/ITTestRepairsCommand.java
##
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.integ;
+
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.cli.AbstractShellIntegrationTest;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.commands.RepairsCommand;
+import org.apache.hudi.cli.commands.TableCommand;
+import org.apache.hudi.common.HoodieClientTestUtils;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
+import 

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1574: [HUDI-701]Add unit test for HDFSParquetImportCommand

2020-05-06 Thread GitBox


yanghua commented on a change in pull request #1574:
URL: https://github.com/apache/incubator-hudi/pull/1574#discussion_r420772541



##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/integ/ITTestHDFSParquetImportCommand.java
##
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.integ;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.cli.AbstractShellIntegrationTest;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.commands.TableCommand;
+import org.apache.hudi.common.HoodieClientTestUtils;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
+import org.apache.hudi.utilities.HDFSParquetImporter;
+import org.apache.hudi.utilities.TestHDFSParquetImporter;
+import org.apache.hudi.utilities.TestHDFSParquetImporter.HoodieTripModel;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.BeforeEach;
+import org.springframework.shell.core.CommandResult;
+
+import java.io.File;
+import java.io.IOException;
+import java.text.ParseException;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import static org.junit.jupiter.api.Assertions.assertAll;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Test class for {@link 
org.apache.hudi.cli.commands.HDFSParquetImportCommand}.
+ */
+public class ITTestHDFSParquetImportCommand extends 
AbstractShellIntegrationTest {
+
+  private Path sourcePath;
+  private Path targetPath;
+  private String tableName;
+  private String schemaFile;
+  private String tablePath;
+
+  private List insertData;
+  private TestHDFSParquetImporter importer;
+
+  @BeforeEach
+  public void init() throws IOException, ParseException {
+tableName = "test_table";
+tablePath = basePath + File.separator + tableName;
+sourcePath = new Path(basePath, "source");
+targetPath = new Path(tablePath);
+schemaFile = new Path(basePath, "file.schema").toString();
+
+// create schema file
+try (FSDataOutputStream schemaFileOS = fs.create(new Path(schemaFile))) {
+  
schemaFileOS.write(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA.getBytes());
+}
+
+importer = new TestHDFSParquetImporter();
+insertData = importer.createInsertRecords(sourcePath);
+  }
+
+  /**
+   * Test case for 'hdfsparquetimport' with insert.
+   */
+  @Test
+  public void testConvertWithInsert() throws IOException {
+String command = String.format("hdfsparquetimport --srcPath %s 
--targetPath %s --tableName %s "
++ "--tableType %s --rowKeyField %s" + " --partitionPathField %s 
--parallelism %s "
++ "--schemaFilePath %s --format %s --sparkMemory %s --retry %s 
--sparkMaster %s",
+sourcePath.toString(), targetPath.toString(), tableName, 
HoodieTableType.COPY_ON_WRITE.name(),
+"_row_key", "timestamp", "1", schemaFile, "parquet", "2G", "1", 
"local");
+CommandResult cr = getShell().executeCommand(command);
+
+assertAll("Command run success",
+() -> assertTrue(cr.isSuccess()),
+() -> assertEquals("Table imported to hoodie format", 
cr.getResult().toString()));
+
+// Check hudi table exist
+String metaPath = targetPath + File.separator + 
HoodieTableMetaClient.METAFOLDER_NAME;
+assertTrue(new File(metaPath).exists(), "Hoodie table not exist.");

Review comment:
   use `Files.exists(xx)`?

##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/HDFSParquetImportCommand.java
##
@@ -57,6 +56,7 @@ public String convert(
   @CliOption(key = "schemaFilePath", mandatory = true,
   help = "Path for Avro schema file") final String schemaFilePath,
   @CliOption(key = "format", mandatory = true, help = "Format for the 
input data") final String format,
+  @CliOption(key 

[jira] [Created] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group

2020-05-06 Thread Roland Johann (Jira)
Roland Johann created HUDI-864:
--

 Summary: parquet schema conflict: optional binary  
(UTF8) is not a group
 Key: HUDI-864
 URL: https://issues.apache.org/jira/browse/HUDI-864
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Roland Johann


When dealing with struct types like this

{code:json}
{
  "type": "struct",
  "fields": [
{
  "name": "categoryResults",
  "type": {
"type": "array",
"elementType": {
  "type": "struct",
  "fields": [
{
  "name": "categoryId",
  "type": "string",
  "nullable": true,
  "metadata": {}
}
  ]
},
"containsNull": true
  },
  "nullable": true,
  "metadata": {}
}
  ]
}
{code}

The second ingest batch throws that exception:


{code}
ERROR [Executor task launch worker for task 15] commit.BaseCommitActionExecutor 
(BaseCommitActionExecutor.java:264) - Error upserting bucketType UPDATE for 
partition :0
org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieException: 
java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieException: operation has failed
at 
org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100)
at 
org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76)
at 
org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73)
at 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258)
at 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271)
at 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieException: 
java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieException: operation has failed
at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:143)
at 
org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:98)
... 34 more
Caused by: java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieException: operation has 

[jira] [Updated] (HUDI-863) nested structs containing decimal types lead to null pointer exception

2020-05-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-863:

Labels: pull-request-available  (was: )

> nested structs containing decimal types lead to null pointer exception
> --
>
> Key: HUDI-863
> URL: https://issues.apache.org/jira/browse/HUDI-863
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Roland Johann
>Priority: Major
>  Labels: pull-request-available
>
> Currently the avro schema gets passed to 
> AvroConversionHelper.createConverterToAvro which itself pocesses passed spark 
> sql DataTypes recursively to resolve structs, arrays, etc.  - the AvroSchema 
> gets passed to recursions, but without selection of the relevant field and 
> therefore schema of that field. That leads to a null pointer exception when 
> decimal types will  be processed, because in that case the schema of the 
> filed will be retrieved by calling getField on the root schema which is not 
> defined when we deal with nested records.
> [AvroConversionHelper.scala#L291|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L291]
> The proposed solution is to remove the dependency on the avro schema and 
> derive the particular avro schema for the decimal converter creator case only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] rolandjohann opened a new pull request #1596: [HUDI-863] get decimal properties from derived spark DataType

2020-05-06 Thread GitBox


rolandjohann opened a new pull request #1596:
URL: https://github.com/apache/incubator-hudi/pull/1596


   ## What is the purpose of the pull request
   This PR fixes a null pointer exception when calling converter for decimal 
datatype.
   
   ## Brief change log
   - get decimal properties from derived spark DataType
   
   ## Verify this pull request
   (currently) This pull request is a trivial rework / code cleanup without any 
test coverage.
   
   Later I'll implement tests for this, as soon as the issue described above is 
fixed.
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on pull request #1594: [HUDI-862] Migrate HUDI site blogs from Confluence to Jeykll.

2020-05-06 Thread GitBox


lamber-ken commented on pull request #1594:
URL: https://github.com/apache/incubator-hudi/pull/1594#issuecomment-624627165


   hi @prashantwason, you can also build site in local env
   ```
   // build site
   cd docs
   docker-compose build --no-cache && docker-compose up
   
   // open browser
   http://localhost:4000
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-863) nested structs containing decimal types lead to null pointer exception

2020-05-06 Thread Roland Johann (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roland Johann updated HUDI-863:
---
Status: Open  (was: New)

> nested structs containing decimal types lead to null pointer exception
> --
>
> Key: HUDI-863
> URL: https://issues.apache.org/jira/browse/HUDI-863
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Roland Johann
>Priority: Major
>
> Currently the avro schema gets passed to 
> AvroConversionHelper.createConverterToAvro which itself pocesses passed spark 
> sql DataTypes recursively to resolve structs, arrays, etc.  - the AvroSchema 
> gets passed to recursions, but without selection of the relevant field and 
> therefore schema of that field. That leads to a null pointer exception when 
> decimal types will  be processed, because in that case the schema of the 
> filed will be retrieved by calling getField on the root schema which is not 
> defined when we deal with nested records.
> [AvroConversionHelper.scala#L291|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L291]
> The proposed solution is to remove the dependency on the avro schema and 
> derive the particular avro schema for the decimal converter creator case only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-863) nested structs containing decimal types lead to null pointer exception

2020-05-06 Thread Roland Johann (Jira)
Roland Johann created HUDI-863:
--

 Summary: nested structs containing decimal types lead to null 
pointer exception
 Key: HUDI-863
 URL: https://issues.apache.org/jira/browse/HUDI-863
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Roland Johann


Currently the avro schema gets passed to 
AvroConversionHelper.createConverterToAvro which itself pocesses passed spark 
sql DataTypes recursively to resolve structs, arrays, etc.  - the AvroSchema 
gets passed to recursions, but without selection of the relevant field and 
therefore schema of that field. That leads to a null pointer exception when 
decimal types will  be processed, because in that case the schema of the filed 
will be retrieved by calling getField on the root schema which is not defined 
when we deal with nested records.
[AvroConversionHelper.scala#L291|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L291]

The proposed solution is to remove the dependency on the avro schema and derive 
the particular avro schema for the decimal converter creator case only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on pull request #1594: [HUDI-862] Migrate HUDI site blogs from Confluence to Jeykll.

2020-05-06 Thread GitBox


lamber-ken commented on pull request #1594:
URL: https://github.com/apache/incubator-hudi/pull/1594#issuecomment-624624840


   > @lamber-ken can you please review this and land ?
   
   Sure,  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on pull request #1594: [HUDI-862] Migrate HUDI site blogs from Confluence to Jeykll.

2020-05-06 Thread GitBox


lamber-ken commented on pull request #1594:
URL: https://github.com/apache/incubator-hudi/pull/1594#issuecomment-624624419


   @prashantwason thanks for driving this, good job  
   
   Quick review: https://lamber-ken.github.io/blog.html
   
   As vinoth says, for code blocks, we can use ` ``` ` syntax.
   
   
![image](https://user-images.githubusercontent.com/20113411/81177555-6e559880-8fd9-11ea-8eef-b75cd037b528.png)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch asf-site updated: Travis CI build asf-site

2020-05-06 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new f1d9a80  Travis CI build asf-site
f1d9a80 is described below

commit f1d9a8088b8bc46242325dc1eb849ad96c240a60
Author: CI 
AuthorDate: Wed May 6 11:23:13 2020 +

Travis CI build asf-site
---
 content/docs/use_cases.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/docs/use_cases.html b/content/docs/use_cases.html
index 640c785..01ea086 100644
--- a/content/docs/use_cases.html
+++ b/content/docs/use_cases.html
@@ -382,7 +382,7 @@ Unfortunately, in today’s post-mobile  pre-IoT world, 
late data f
 In such cases, the only remedy to guarantee correctness is to https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data;>reprocess
 the last few hours worth of data,
 over and over again each hour, which can significantly hurt the efficiency 
across the entire ecosystem. For e.g; imagine reprocessing TBs worth of data 
every hour across hundreds of workflows.
 
-Hudi comes to the rescue again, by providing a way to consume new data 
(including late data) from an upsteam Hudi table HU at a record granularity (not 
folders/partitions),
+Hudi comes to the rescue again, by providing a way to consume new data 
(including late data) from an upstream Hudi table HU at a record granularity (not 
folders/partitions),
 apply the processing logic, and efficiently update/reconcile late data with a 
downstream Hudi table HD. Here, HU and HD can be continuously scheduled at a much 
more frequent schedule
 like 15 mins, and providing an end-end latency of 30 mins at HD.
 



[incubator-hudi] branch asf-site updated: fix typo (#1587)

2020-05-06 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new e5cb883  fix typo (#1587)
e5cb883 is described below

commit e5cb883a8440d53e97f3837b583831eef7db2ae5
Author: wanglisheng81 <37138788+wanglishen...@users.noreply.github.com>
AuthorDate: Wed May 6 19:16:28 2020 +0800

fix typo (#1587)
---
 docs/_docs/1_3_use_cases.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_docs/1_3_use_cases.md b/docs/_docs/1_3_use_cases.md
index 25b35bb..c6a5623 100644
--- a/docs/_docs/1_3_use_cases.md
+++ b/docs/_docs/1_3_use_cases.md
@@ -48,7 +48,7 @@ Unfortunately, in today's post-mobile & pre-IoT world, __late 
data from intermit
 In such cases, the only remedy to guarantee correctness is to [reprocess the 
last few 
hours](https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data)
 worth of data,
 over and over again each hour, which can significantly hurt the efficiency 
across the entire ecosystem. For e.g; imagine reprocessing TBs worth of data 
every hour across hundreds of workflows.
 
-Hudi comes to the rescue again, by providing a way to consume new data 
(including late data) from an upsteam Hudi table `HU` at a record granularity 
(not folders/partitions),
+Hudi comes to the rescue again, by providing a way to consume new data 
(including late data) from an upstream Hudi table `HU` at a record granularity 
(not folders/partitions),
 apply the processing logic, and efficiently update/reconcile late data with a 
downstream Hudi table `HD`. Here, `HU` and `HD` can be continuously scheduled 
at a much more frequent schedule
 like 15 mins, and providing an end-end latency of 30 mins at `HD`.
 



[jira] [Closed] (HUDI-812) Migrate hudi-common tests to JUnit 5

2020-05-06 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-812.
-
Resolution: Done

Done via master branch: 366bb10d8c4fe98424f09a6cf6f4aee7716451a4

> Migrate hudi-common tests to JUnit 5
> 
>
> Key: HUDI-812
> URL: https://issues.apache.org/jira/browse/HUDI-812
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch asf-site updated: Travis CI build asf-site

2020-05-06 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 206e549  Travis CI build asf-site
206e549 is described below

commit 206e549637101c6fe49aaed915c41f21ad884e81
Author: CI 
AuthorDate: Wed May 6 08:42:27 2020 +

Travis CI build asf-site
---
 content/docs/comparison.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/docs/comparison.html b/content/docs/comparison.html
index 3919cbd..223e9f7 100644
--- a/content/docs/comparison.html
+++ b/content/docs/comparison.html
@@ -349,7 +349,7 @@ Consequently, Kudu does not support incremental pulling (as 
of early 2017), some
 
 Kudu diverges from a distributed file system abstraction and HDFS 
altogether, with its own set of storage servers talking to each  other via RAFT.
 Hudi, on the other hand, is designed to work with an underlying Hadoop 
compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of 
storage servers,
-instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
+instead relying on Apache Spark to do the heavy-lifting. Thus, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
  operational support, typical to datastores like HBase or Vertica. We 
have not at this point, done any head to head benchmarks against Kudu (given 
RTTable is WIP).
 But, if we were to go with results shared by https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines;>CERN
 ,
 we expect Hudi to positioned at something that ingests parquet with superior 
performance.



[GitHub] [incubator-hudi] vinothchandar commented on pull request #1572: [HUDI-836] Implement datadog metrics reporter

2020-05-06 Thread GitBox


vinothchandar commented on pull request #1572:
URL: https://github.com/apache/incubator-hudi/pull/1572#issuecomment-624504748


   @xushiyan absolutely... please go ahead! 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch asf-site updated: [MINOR] fix typo in comparison document (#1588)

2020-05-06 Thread lamberken
This is an automated email from the ASF dual-hosted git repository.

lamberken pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 183aac0  [MINOR] fix typo in comparison document (#1588)
183aac0 is described below

commit 183aac0cbc186b14b557ebe3b678c320bd6fef91
Author: wanglisheng81 <37138788+wanglishen...@users.noreply.github.com>
AuthorDate: Wed May 6 16:08:03 2020 +0800

[MINOR] fix typo in comparison document (#1588)
---
 docs/_docs/1_5_comparison.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_docs/1_5_comparison.md b/docs/_docs/1_5_comparison.md
index 78f2be2..32b73c6 100644
--- a/docs/_docs/1_5_comparison.md
+++ b/docs/_docs/1_5_comparison.md
@@ -18,7 +18,7 @@ Consequently, Kudu does not support incremental pulling (as 
of early 2017), some
 
 Kudu diverges from a distributed file system abstraction and HDFS altogether, 
with its own set of storage servers talking to each  other via RAFT.
 Hudi, on the other hand, is designed to work with an underlying Hadoop 
compatible filesystem (HDFS,S3 or Ceph) and does not have its own fleet of 
storage servers,
-instead relying on Apache Spark to do the heavy-lifting. Thu, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
+instead relying on Apache Spark to do the heavy-lifting. Thus, Hudi can be 
scaled easily, just like other Spark jobs, while Kudu would require hardware
 & operational support, typical to datastores like HBase or Vertica. We have 
not at this point, done any head to head benchmarks against Kudu (given RTTable 
is WIP).
 But, if we were to go with results shared by 
[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
 ,
 we expect Hudi to positioned at something that ingests parquet with superior 
performance.



[GitHub] [incubator-hudi] lamber-ken commented on pull request #1588: [MINOR] fix typo in comparison document

2020-05-06 Thread GitBox


lamber-ken commented on pull request #1588:
URL: https://github.com/apache/incubator-hudi/pull/1588#issuecomment-624503516


   Good catch @wanglisheng81 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on issue #1563: [SUPPORT] When I package according to the package command in GitHub, I always report an error, such as

2020-05-06 Thread GitBox


lamber-ken commented on issue #1563:
URL: https://github.com/apache/incubator-hudi/issues/1563#issuecomment-624498974


   Closing this due to inactivity.
   
   **Prerequisites for building Apache Hudi:**
   - Unix-like system (like Linux, Mac OS X)
   - Java 8 (Java 9 or 10 may work)
   - Git
   - Maven



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org