Build failed in Jenkins: hudi-snapshot-deployment-0.5 #189

2020-02-14 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.25 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.2-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an e

[GitHub] [incubator-hudi] smarthi opened a new pull request #1337: [MINOR] Code Cleanup, remove redundant code

2020-02-14 Thread GitBox
smarthi opened a new pull request #1337: [MINOR] Code Cleanup, remove redundant 
code
URL: https://github.com/apache/incubator-hudi/pull/1337
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Code cleanup - prelude to Guava elimination
   
   ## Brief change log
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [X] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-02-14 Thread GitBox
nsivabalan commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi 
Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-586545073
 
 
   Sure. go ahead. I also plan to review it sometime. but will let you be the 
primary reviewer. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf merged pull request #1336: [MINOR] Fix some typos

2020-02-14 Thread GitBox
leesf merged pull request #1336: [MINOR] Fix some typos
URL: https://github.com/apache/incubator-hudi/pull/1336
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [MINOR] Fix some typos

2020-02-14 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new aaa6cf9  [MINOR] Fix some typos
aaa6cf9 is described below

commit aaa6cf9a989e88d3273cc0a9fffc1eaa8a6e6b92
Author: wangxianghu 
AuthorDate: Sat Feb 15 08:55:23 2020 +0800

[MINOR] Fix some typos
---
 hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java  | 8 
 .../apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java  | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java 
b/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
index 7e91df1..23055da 100644
--- a/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
+++ b/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
@@ -101,7 +101,7 @@ public class HoodieWriteClient extends AbstractHo
 
 
   /**
-   * Create a wirte client, without cleaning up failed/inflight commits.
+   * Create a write client, without cleaning up failed/inflight commits.
*
* @param jsc Java Spark Context
* @param clientConfig instance of HoodieWriteConfig
@@ -111,7 +111,7 @@ public class HoodieWriteClient extends AbstractHo
   }
 
   /**
-   * Create a wirte client, with new hudi index.
+   * Create a write client, with new hudi index.
*
* @param jsc Java Spark Context
* @param clientConfig instance of HoodieWriteConfig
@@ -127,7 +127,7 @@ public class HoodieWriteClient extends AbstractHo
   }
 
   /**
-   *  Create a wirte client, allows to specify all parameters.
+   *  Create a write client, allows to specify all parameters.
*
* @param jsc Java Spark Context
* @param clientConfig instance of HoodieWriteConfig
@@ -1054,7 +1054,7 @@ public class HoodieWriteClient extends AbstractHo
   /**
* Perform compaction operations as specified in the compaction commit file.
*
-   * @param compactionInstant Compacton Instant time
+   * @param compactionInstant Compaction Instant time
* @param activeTimeline Active Timeline
* @param autoCommit Commit after compaction
* @return RDD of Write Status
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
index 9c05f31..820875c 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
@@ -485,7 +485,7 @@ public class HoodieDeltaStreamer implements Serializable {
   }
 
   /**
-   * Async Compactor Service tha runs in separate thread. Currently, only one 
compactor is allowed to run at any time.
+   * Async Compactor Service that runs in separate thread. Currently, only one 
compactor is allowed to run at any time.
*/
   public static class AsyncCompactService extends AbstractDeltaStreamerService 
{
 



[GitHub] [incubator-hudi] wangxianghu opened a new pull request #1336: [MINOR] Fix some typos

2020-02-14 Thread GitBox
wangxianghu opened a new pull request #1336: [MINOR] Fix some typos
URL: https://github.com/apache/incubator-hudi/pull/1336
 
 
   ## What is the purpose of the pull request
   
   *Fix some typos*
   
   ## Brief change log
   
   *Fix some typos*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset 
reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586525355
 
 
   From the stackstrace, the application will read three partition, can you 
check the range is availabel in kafka?
   ```
   topic: 'test-topic'
   partition: 0, [0 -> 6667]
   partition: 1, [0 -> 6667]
   partition: 2, [0 -> ]
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset 
reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586520820
 
 
   I test `0.5.1-incubating` and set auto.offset.reset=earliest, it works well 
in my local env.
   Spark: spark-2.4.4-bin-hadoop2.7
   Kafka: kafka_2.11-0.10.2.1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ramachandranms commented on a change in pull request #1332: [HUDI -409] Match header and footer block length to improve corrupted block detection

2020-02-14 Thread GitBox
ramachandranms commented on a change in pull request #1332: [HUDI -409] Match 
header and footer block length to improve corrupted block detection
URL: https://github.com/apache/incubator-hudi/pull/1332#discussion_r379689535
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 ##
 @@ -362,13 +370,18 @@ public HoodieLogBlock prev() throws IOException {
 // blocksize should read everything about a block including the length as 
well
 try {
   inputStream.seek(reverseLogFilePosition - blockSize);
+  // get the block size from head and match it with the block size from 
tail
 
 Review comment:
   talked offline. will create a new JIRA to address reverse reading issues


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] popart edited a comment on issue #1329: [SUPPORT] Presto cannot query non-partitioned table

2020-02-14 Thread GitBox
popart edited a comment on issue #1329: [SUPPORT] Presto cannot query 
non-partitioned table
URL: https://github.com/apache/incubator-hudi/issues/1329#issuecomment-586509050
 
 
   HI Bhavani! Thank you for taking a look. I filed 
https://issues.apache.org/jira/browse/HUDI-614. 
   
   Correct the Presto version is .227. 
   
   I tried running the spark-shell with both the Hudi 0.5.0 and the 0.5.1 jars, 
but got the same result. The EMR version has Hudi 0.5.0 installed, and I didn't 
specify anything different when running presto-cli, so I'd assume Presto is 
using the 0.5.0 version.
   
   I do see the .hoodie_partition_metadata file in my S3 table path.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] popart commented on issue #1329: [SUPPORT] Presto cannot query non-partitioned table

2020-02-14 Thread GitBox
popart commented on issue #1329: [SUPPORT] Presto cannot query non-partitioned 
table
URL: https://github.com/apache/incubator-hudi/issues/1329#issuecomment-586509050
 
 
   HI Bhavani! Thank you for taking a look. I filed 
https://issues.apache.org/jira/browse/HUDI-614. Correct the Presto version is 
.227. I tried running the spark-shell with both the Hudi 0.5.0 and the 0.5.1 
jars, but got the same result. The EMR version has Hudi 0.5.0 installed, and I 
didn't specify anything different when running presto-cli, so I'd assume Presto 
is using the 0.5.0 version.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-614) .hoodie_partition_metadata created for non-partitioned table

2020-02-14 Thread Andrew Wong (Jira)
Andrew Wong created HUDI-614:


 Summary: .hoodie_partition_metadata created for non-partitioned 
table
 Key: HUDI-614
 URL: https://issues.apache.org/jira/browse/HUDI-614
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Andrew Wong


Original issue: [https://github.com/apache/incubator-hudi/issues/1329]

I made a non-partitioned Hudi table using Spark. I was able to query it with 
Spark & Hive, but when I tried querying it with Presto, I received the error 
{{Could not find partitionDepth in partition metafile}}.

I attempted this task using emr-5.28.0 in AWS. I tried using the built-in 
spark-shell with both Amazon's /usr/lib/hudi/hudi-spark-bundle.jar (following 
[https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/)]
 and the org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating jar (following 
[https://hudi.apache.org/docs/quick-start-guide.html]).

I used NonpartitionedKeyGenerator & NonPartitionedExtractor in my write 
options, according to 
[https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoIuseDeltaStreamerorSparkDataSourceAPItowritetoaNon-partitionedHudidataset?].
 You can see my code in the github issue linked above.

In both cases I see the .hoodie_partition_metadata file was created in the 
table path in S3. Querying the table worked in spark-shell & hive-cli, but 
attempting to query the table in presto-cli resulted in the error, "Could not 
find partitionDepth in partition metafile".

Please look into the bug or check the documentation. If there is a problem with 
the EMR install I can contact the AWS team responsible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-14 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379646458
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -9,7 +9,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark and Presto.
+bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark datasource, Spark SQL and Presto.
 
 Review comment:
   it's not released yet though..but we should file a ticket for doc-ing that 
nonetheless, when a impala release does happen 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-14 Thread GitBox
vinothchandar commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379646458
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -9,7 +9,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark and Presto.
+bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark datasource, Spark SQL and Presto.
 
 Review comment:
   it's not released yet though..but we should file a ticket for doc-ing that 
nonetheless. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bwu2 commented on issue #1328: Hudi upsert hangs

2020-02-14 Thread GitBox
bwu2 commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-586457085
 
 
   See: https://gist.github.com/bwu2/e432a42f51519f27197f4785af3e1abf


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-14 Thread GitBox
bhasudha commented on issue #1325: presto - querying nested object in parquet 
file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-586453479
 
 
   @adamjoneill  apologies for the delayed response. Havent gotten a  chance to 
look at  this thread. Let me also try and reproduce this and get back soon. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1328: Hudi upsert hangs

2020-02-14 Thread GitBox
vinothchandar commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-586451527
 
 
   Even #800 is a reasonable workload.. I don't understand what's going on here 
.. Its just a single file being versioned.. same as the next two commits, which 
seems to be faster.. Is there a way you can give me a reproducible snippet of 
code? (I think you mentioned this is synthetic data) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-14 Thread GitBox
vinothchandar commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-586450568
 
 
   Thanks @adamjoneill let me try to reproduce as well and see whats going on 
tonight.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash merged pull request #1312: [HUDI-571] Add "compactions show archived" command to CLI

2020-02-14 Thread GitBox
n3nash merged pull request #1312: [HUDI-571] Add "compactions show archived" 
command to CLI
URL: https://github.com/apache/incubator-hudi/pull/1312
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-571] Add show archived compaction(s) to CLI

2020-02-14 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 20ed251  [HUDI-571] Add show archived compaction(s) to CLI
20ed251 is described below

commit 20ed2516d38b9ce4b3e185bd89b62264b8bd3f25
Author: Satish Kotha 
AuthorDate: Wed Jan 22 13:50:34 2020 -0800

[HUDI-571] Add show archived compaction(s) to CLI
---
 .../apache/hudi/cli/commands/CommitsCommand.java   |  13 +-
 .../hudi/cli/commands/CompactionCommand.java   | 247 -
 .../java/org/apache/hudi/cli/utils/CommitUtil.java |  23 ++
 3 files changed, 216 insertions(+), 67 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
index 1e17c4c..804096b 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
@@ -21,6 +21,7 @@ package org.apache.hudi.cli.commands;
 import org.apache.hudi.cli.HoodieCLI;
 import org.apache.hudi.cli.HoodiePrintHelper;
 import org.apache.hudi.cli.TableHeader;
+import org.apache.hudi.cli.utils.CommitUtil;
 import org.apache.hudi.cli.utils.InputStreamConsumer;
 import org.apache.hudi.cli.utils.SparkUtil;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
@@ -41,10 +42,8 @@ import org.springframework.shell.core.annotation.CliOption;
 import org.springframework.stereotype.Component;
 
 import java.io.IOException;
-import java.time.ZonedDateTime;
 import java.util.ArrayList;
 import java.util.Collections;
-import java.util.Date;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
@@ -186,10 +185,10 @@ public class CommitsCommand implements CommandMarker {
   final boolean headerOnly)
   throws IOException {
 if (StringUtils.isNullOrEmpty(startTs)) {
-  startTs = getTimeDaysAgo(10);
+  startTs = CommitUtil.getTimeDaysAgo(10);
 }
 if (StringUtils.isNullOrEmpty(endTs)) {
-  endTs = getTimeDaysAgo(1);
+  endTs = CommitUtil.getTimeDaysAgo(1);
 }
 HoodieArchivedTimeline archivedTimeline = 
HoodieCLI.getTableMetaClient().getArchivedTimeline();
 try {
@@ -362,10 +361,4 @@ public class CommitsCommand implements CommandMarker {
 return "Load sync state between " + 
HoodieCLI.getTableMetaClient().getTableConfig().getTableName() + " and "
 + HoodieCLI.syncTableMetadata.getTableConfig().getTableName();
   }
-
-  private String getTimeDaysAgo(int numberOfDays) {
-Date date = 
Date.from(ZonedDateTime.now().minusDays(numberOfDays).toInstant());
-return HoodieActiveTimeline.COMMIT_FORMATTER.format(date);
-  }
-
 }
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
index 0b57947..2564931 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
@@ -26,16 +26,20 @@ import org.apache.hudi.cli.HoodieCLI;
 import org.apache.hudi.cli.HoodiePrintHelper;
 import org.apache.hudi.cli.TableHeader;
 import org.apache.hudi.cli.commands.SparkMain.SparkCommand;
+import org.apache.hudi.cli.utils.CommitUtil;
 import org.apache.hudi.cli.utils.InputStreamConsumer;
 import org.apache.hudi.cli.utils.SparkUtil;
 import org.apache.hudi.common.model.HoodieTableType;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.HoodieTimeline;
 import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieArchivedTimeline;
+import org.apache.hudi.common.table.timeline.HoodieDefaultTimeline;
 import org.apache.hudi.common.table.timeline.HoodieInstant;
-import org.apache.hudi.common.table.timeline.HoodieInstant.State;
 import org.apache.hudi.common.util.AvroUtils;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.func.OperationResult;
@@ -61,8 +65,10 @@ import java.util.List;
 import java.util.Map;
 import java.util.Set;
 import java.util.UUID;
+import java.util.function.BiFunction;
 import java.util.function.Function;
 import java.util.stream.Collectors;
+import java.util.stream.Stream;
 
 /**
  * CLI command to display compaction related options.
@@ -95,51 +101,9 @@ public class CompactionCommand implements CommandMarker {
   throws IOException {
 HoodieTableMetaClient client = checkAndGetMetaClient();
 HoodieActiveTimeline activeTimeline = client.getActiveTimeline();
-Hoodi

[GitHub] [incubator-hudi] amitsingh-10 commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
amitsingh-10 commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka 
offset reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586408479
 
 
   Didn't help. I had already tried, but I tried it again with the 
`0.5.1-incubating` version. However, still getting the same error as mentioned 
in the original stack trace.
   
Is this a problem because of my Spark version mismatch with the Hudi's 
spark version? Should I try building Hudi with my spark version?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash edited a comment on issue #1242: [HUDI-544] Adjust the read and write path of archive

2020-02-14 Thread GitBox
n3nash edited a comment on issue #1242: [HUDI-544] Adjust the read and write 
path of archive
URL: https://github.com/apache/incubator-hudi/pull/1242#issuecomment-586394357
 
 
   @hddong please take a look at the last comment and squash all your commits 
please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1242: [HUDI-544] Adjust the read and write path of archive

2020-02-14 Thread GitBox
n3nash commented on issue #1242: [HUDI-544] Adjust the read and write path of 
archive
URL: https://github.com/apache/incubator-hudi/pull/1242#issuecomment-586394357
 
 
   @hddong please take a look at the last comment


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1242: [HUDI-544] Adjust the read and write path of archive

2020-02-14 Thread GitBox
n3nash commented on a change in pull request #1242: [HUDI-544] Adjust the read 
and write path of archive
URL: https://github.com/apache/incubator-hudi/pull/1242#discussion_r379560656
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java
 ##
 @@ -138,9 +139,11 @@ public String showCommits(
   throws IOException {
 
 System.out.println("===> Showing only " + limit + " archived 
commits <===");
-String basePath = HoodieCLI.getTableMetaClient().getBasePath();
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+String basePath = metaClient.getBasePath();
+Path archivePath = new Path(metaClient.getArchivePath() + 
"/.commits_.archive*");
 FileStatus[] fsStatuses =
-FSUtils.getFs(basePath, HoodieCLI.conf).globStatus(new Path(basePath + 
"/.hoodie/.commits_.archive*"));
+FSUtils.getFs(basePath, HoodieCLI.conf).globStatus(archivePath);
 
 Review comment:
   @hmatu I agree, I think this PR is just doing some basic cleanup so we 
should just rename this PR and merge it and fix the adjusting of the read/write 
path in a different one. 
   @hddong please rename this PR as "archived commits command code cleanup" and 
we can merge this


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-14 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r379559714
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
 ##
 @@ -182,8 +183,11 @@ private String getMetadataKey(String action) {
   //read the avro blocks
   while (reader.hasNext()) {
 HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
-// TODO If we can store additional metadata in datablock, we can 
skip parsing records
-// (such as startTime, endTime of records in the block)
+if (isDataOutOfRange(blk, filter)) {
 
 Review comment:
   I added a comment above which I think addresses this as well


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-14 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r379559545
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlock.java
 ##
 @@ -121,7 +121,7 @@ public long getLogBlockLength() {
* new enums at the end.
*/
   public enum HeaderMetadataType {
-INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE
+INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE, 
MIN_INSTANT_TIME, MAX_INSTANT_TIME
 
 Review comment:
   It's not about them being just generic, think of it this way - we are going 
to start dumping MetadataTypes into that one class. Now, different log blocks 
written in different parts of the code (archival, actual data, may be indexes, 
may be consolidated metadata) - we will have no idea what header type is used 
where without looking through the entire code base, having some abstraction 
here will be very helpful. I agree with your point about ordinals in nested 
enums, lets test it out and if this doesn't work, we should find another 
abstraction. let me know if that makes sense.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-14 Thread GitBox
n3nash commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r379558732
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -268,6 +270,19 @@ public Path getArchiveFilePath() {
 return archiveFilePath;
   }
 
+  private void writeHeaderBlock(Schema wrapperSchema, List 
instants) throws Exception {
+if (!instants.isEmpty()) {
+  Collections.sort(instants, HoodieInstant.COMPARATOR);
+  HoodieInstant minInstant = instants.get(0);
+  HoodieInstant maxInstant = instants.get(instants.size() - 1);
+  Map metadataMap = Maps.newHashMap();
+  metadataMap.put(HeaderMetadataType.SCHEMA, wrapperSchema.toString());
+  metadataMap.put(HeaderMetadataType.MIN_INSTANT_TIME, 
minInstant.getTimestamp());
+  metadataMap.put(HeaderMetadataType.MAX_INSTANT_TIME, 
maxInstant.getTimestamp());
+  this.writer.appendBlock(new HoodieAvroDataBlock(Collections.emptyList(), 
metadataMap));
+}
+  }
+
   private void writeToFile(Schema wrapperSchema, List records) 
throws Exception {
 
 Review comment:
   You are right that the file is closed after archiving all instants that 
qualify in that archiving process. But the next time an archival kicks in, it 
will check if the archival file is grown to a certain size (say 1GB), if not, 
it will append the next archival blocks to the same file..


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1332: [HUDI -409] Match header and footer block length to improve corrupted block detection

2020-02-14 Thread GitBox
n3nash commented on a change in pull request #1332: [HUDI -409] Match header 
and footer block length to improve corrupted block detection
URL: https://github.com/apache/incubator-hudi/pull/1332#discussion_r379557708
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 ##
 @@ -362,13 +370,18 @@ public HoodieLogBlock prev() throws IOException {
 // blocksize should read everything about a block including the length as 
well
 try {
   inputStream.seek(reverseLogFilePosition - blockSize);
+  // get the block size from head and match it with the block size from 
tail
 
 Review comment:
   This code is pretty complicated to understand. I see that you removed 
hasNext() and added some code around this + handling of corrupt blocks here. 
Can this be simplified with hasNext(), checkCorruptBlock() method level 
abstractions ? We need more logs as well to explain such checks..


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1332: [HUDI -409] Match header and footer block length to improve corrupted block detection

2020-02-14 Thread GitBox
n3nash commented on a change in pull request #1332: [HUDI -409] Match header 
and footer block length to improve corrupted block detection
URL: https://github.com/apache/incubator-hudi/pull/1332#discussion_r379556420
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 ##
 @@ -239,6 +239,15 @@ private boolean isBlockCorrupt(int blocksize) throws 
IOException {
   return true;
 }
 
+// check if the blocksize mentioned in the footer is the same as the 
header; by seeking back the length of a long
+// the backward seek does not incur additional IO as {@link 
org.apache.hadoop.hdfs.DFSInputStream#seek()}
+// only moves the index. actual IO happens on the next read operation
+inputStream.seek(inputStream.getPos() - Long.BYTES);
 
 Review comment:
   Also, we need some kind of abstraction here, feel like the reading of the 
block size based on long bytes is leaking here and will open to other issues 
when trying to evolve the code. For now, can you move this to a method and add 
a test case just for this ? Make that method @VisibleForTesting so it's 
available to be called in the test class - this way if the footer is evolved, 
we have a test to catch issues


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka 
offset reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586335056
 
 
   The `KafkaUtils#fixKafkaParams` method in 
`spark-streaming-kafka-0-10_2.11-2.4.4`, and KafkaUtil being experimental in 
spark-streaming-kafka-2.4.4. Can you try use an new consumer group and set 
auto.offset.reset=earliest ?
   
   
![image](https://user-images.githubusercontent.com/20113411/74543766-f00ccb80-4f80-11ea-8ace-6fa08b69069e.png)
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka 
offset reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586335056
 
 
   The `KafkaUtils#fixKafkaParams` method in 
`spark-streaming-kafka-0-10_2.11-2.4.4`, and KafkaUtil being experimental in 
spark-streaming-kafka-2.4.4.
   
   
![image](https://user-images.githubusercontent.com/20113411/74543766-f00ccb80-4f80-11ea-8ace-6fa08b69069e.png)
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset 
reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586335056
 
 
   The `KafkaUtils#fixKafkaParams` method in 
`spark-streaming-kafka-0-10_2.11-2.4.4`, and KafkaUtil being experimental in 
spark-streaming-kafka-2.4.4. Can you try use a new consumer group?
   
   
![image](https://user-images.githubusercontent.com/20113411/74543766-f00ccb80-4f80-11ea-8ace-6fa08b69069e.png)
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] amitsingh-10 edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
amitsingh-10 edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer 
Kafka offset reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586330008
 
 
   The Kafka version is `2.1.0-cp2`. Also is there any PR/issue where I can 
better understand why `KafkaUtils#fixKafkaParams` is being called in 
`0.5.1-incubating` even when I am providing all the necessary parameters?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] amitsingh-10 commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
amitsingh-10 commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka 
offset reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586330008
 
 
   The Kafka version is `2.1.0-cp2`. Also is there any PR/issue where I can 
better understand why fixKafkaParams is being called in `0.5.1-incubating` even 
when I am providing all the necessary parameters?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-14 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-586284798
 
 
   @vinothchandar I've managed to reproduce with a simple spark.parallelize() 
example.
   
   test.scala
   ```
   import org.apache.spark._
   import org.apache.spark.sql._
   import org.apache.spark.sql.functions.{month, year, col, dayofmonth}
   import org.apache.spark.storage.StorageLevel
   import org.apache.spark.streaming.{Milliseconds, StreamingContext}
   import org.apache.spark.streaming.Duration
   import org.apache.spark.streaming.kinesis._
   import org.apache.spark.streaming.kinesis.KinesisInputDStream
   import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.config.HoodieWriteConfig
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   
   import org.apache.spark.sql.types._
   import org.apache.spark.sql.DataFrame
   
   case class Bar(id: Int, name: String)
   
   // choose one of the following
   case class Foo(id: Int, bar: Bar) // with simple
   // case class Foo(bar: Bar) // withOUT simple
   
   case class Root(id: Int, foos: Array[Foo])
   
   object HudiScalaStreamHelloWorld {
   
   def main(args: Array[String]): Unit = {  
   val appName = "ScalaStreamExample"
   val batchInterval = Milliseconds(2000)
   val spark = SparkSession
   .builder()
   .appName(appName)
   .getOrCreate()
   
   val sparkContext = spark.sparkContext
   val streamingContext = new StreamingContext(sparkContext, 
batchInterval)
   
   import spark.implicits._
   val sc = sparkContext
   
   
   // choose one of the following
   val dataFrame = sc.parallelize(Seq(Root(1, Array(Foo(1, Bar(1, 
"OneBar")).toDF() // with simple
   // val dataFrame = sc.parallelize(Seq(Root(1, Array(Foo(Bar(1, 
"OneBar")).toDF() // withOUT simple
   
   dataFrame.printSchema()
   
   val hudiTableName = "order"
   val hudiTablePath = "s3://xxx/path/" + hudiTableName
   
   // Set up our Hudi Data Source Options
   val hudiOptions = Map[String,String](
   DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
   HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
   DataSourceWriteOptions.OPERATION_OPT_KEY ->
   DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
   DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "id")
   
   
dataFrame.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath)
   }
   }
   
   ```
   
   deploy.sh
   ```
   sbt clean
   sbt package
   
   aws s3 cp ./target/scala-2.11/simple-project_2.11-1.0.jar 
s3://./simple-project_2.11-1.0.jar
   
   aws emr add-steps --cluster-id j-AZQBZK81NAFT --steps 
Type=spark,Name=SimpleHudiTest,Args=[\
   --deploy-mode,cluster,\
   --master,yarn,\
   
--packages,\'org.apache.hudi:hudi-spark-bundle:0.5.0-incubating,org.apache.spark:spark-avro_2.11:2.4.4\',\
   --conf,spark.yarn.submit.waitAppCompletion=false,\
   --conf,yarn.log-aggregation-enable=true,\
   --conf,spark.dynamicAllocation.enabled=true,\
   --conf,spark.cores.max=4,\
   --conf,spark.network.timeout=300,\
   --conf,spark.serializer=org.apache.spark.serializer.KryoSerializer,\
   --conf,spark.sql.hive.convertMetastoreParquet=false,\
   --class,HudiScalaStreamHelloWorld,\
   s3://.xxx/simple-project_2.11-1.0.jar\
   ],ActionOnFailure=CONTINUE
   ```
   
   build.sbt
   ```
   name := "Simple Project"
   
   version := "1.0"
   
   scalaVersion := "2.11.12"
   
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
   libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4"
   libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % 
"2.4.4"
   libraryDependencies += "org.apache.hudi" % "hudi-spark-bundle" % 
"0.5.0-incubating"
   
   
   scalacOptions := Seq("-unchecked", "-deprecation")
   
   ```
   
   AWS glue job runs over the output s3 directory. 
   
   From the presto EMR instance the result when simple object included on the 
array item:
   ```
   presto:schema> select * from default;
_hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | 
_hoodie_partition_path |  _hoodie_file_name 
 | id |   foos
   
-+--+++-++---
20200214112552  | 20200214112552_0_1   | 1  | default   
 | 
90190f46-d064-4c8b-ab6a-89ecc9b3ced4-0_0-5-8_20200214112

[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-14 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-586284798
 
 
   @vinothchandar I've managed to reproduce with a simple spark.parallelize() 
example.
   
   test.scala
   ```
   import org.apache.spark._
   import org.apache.spark.sql._
   import org.apache.spark.sql.functions.{month, year, col, dayofmonth}
   import org.apache.spark.storage.StorageLevel
   import org.apache.spark.streaming.{Milliseconds, StreamingContext}
   import org.apache.spark.streaming.Duration
   import org.apache.spark.streaming.kinesis._
   import org.apache.spark.streaming.kinesis.KinesisInputDStream
   import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.config.HoodieWriteConfig
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   
   import org.apache.spark.sql.types._
   import org.apache.spark.sql.DataFrame
   
   
   case class Bar(id: Int, name: String)
   
   // Uncomment following section based on example
   
   // START - Simple object included in array item
   case class Foo(id: Int, bar: Bar) // foo with simple object
   // END - Simple object included in array item
   
   // START - Simple object not present in array item
   // missing the id: Int property
   // case class Foo(bar: Bar) // foo without simple object
   // END - Simple object not present in array item
   
   
   case class Root(id: Int, foos: Array[Foo])
   
   object HudiScalaStreamHelloWorld {
   
   def main(args: Array[String]): Unit = {  
   val appName = "ScalaStreamExample"
   val batchInterval = Milliseconds(2000)
   val spark = SparkSession
   .builder()
   .appName(appName)
   .getOrCreate()
   
   val sparkContext = spark.sparkContext
   val streamingContext = new StreamingContext(sparkContext, 
batchInterval)
   
   import spark.implicits._
   val sc = sparkContext
   
   case class Bar(id: Int, name: String)
   
   // Uncomment following section based on example
   
   // START - Simple object included in array item
   // with simple item on foo in array
   // val dataFrame = sc.parallelize(Seq(Root(1, Array(Foo(1, Bar(1, 
"OneBar")).toDF()
   // END - Simple object included in array item
   
   // START - Simple object not present in array item
   // without simple item on foo in array
   val dataFrame = sc.parallelize(Seq(Root(1, Array(Foo(Bar(1, 
"OneBar")).toDF()
   // END - Simple object not present in array item
   
   dataFrame.printSchema()
   
   val hudiTableName = "order"
   val hudiTablePath = "s3://xxx-/path/" + hudiTableName
   
   // Set up our Hudi Data Source Options
   val hudiOptions = Map[String,String](
   DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
   HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
   DataSourceWriteOptions.OPERATION_OPT_KEY ->
   DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
   DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "id")
   
   
dataFrame.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath)
   }
   }
   ```
   
   deploy.sh
   ```
   sbt clean
   sbt package
   
   aws s3 cp ./target/scala-2.11/simple-project_2.11-1.0.jar 
s3://./simple-project_2.11-1.0.jar
   
   aws emr add-steps --cluster-id j-AZQBZK81NAFT --steps 
Type=spark,Name=SimpleHudiTest,Args=[\
   --deploy-mode,cluster,\
   --master,yarn,\
   
--packages,\'org.apache.hudi:hudi-spark-bundle:0.5.0-incubating,org.apache.spark:spark-avro_2.11:2.4.4\',\
   --conf,spark.yarn.submit.waitAppCompletion=false,\
   --conf,yarn.log-aggregation-enable=true,\
   --conf,spark.dynamicAllocation.enabled=true,\
   --conf,spark.cores.max=4,\
   --conf,spark.network.timeout=300,\
   --conf,spark.serializer=org.apache.spark.serializer.KryoSerializer,\
   --conf,spark.sql.hive.convertMetastoreParquet=false,\
   --class,HudiScalaStreamHelloWorld,\
   s3://.xxx/simple-project_2.11-1.0.jar\
   ],ActionOnFailure=CONTINUE
   ```
   
   build.sbt
   ```
   name := "Simple Project"
   
   version := "1.0"
   
   scalaVersion := "2.11.12"
   
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
   libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4"
   libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % 
"2.4.4"
   libraryDependencies += "org.apache.hudi" % "hudi-spark-bundle" % 
"0.5.0-incubating"
   
   
   scalacOptions := Seq("-unchecked", "-deprecation")
   
   ```
   
   AWS glue job runs over the output s3 directory. 
   
   From the presto EMR instance the result when simple objec

[GitHub] [incubator-hudi] lamber-ken removed a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken removed a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer 
Kafka offset reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586290053
 
 
   Because use `0.5.0-incubating` can't solve your problem quickly, so we need 
to back to `0.5.1-incubating` version, then try to find a valid way to solve 
it. : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka 
offset reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586285604
 
 
   That pr is used to check up checkpoint offsets is valid or not. Bad to see 
use `0.5.0-incubating` version didn't work. BTW, what's your kafka server 
version?
   
   if use `0.5.0-incubating` can't solve your problem quickly, so we may need 
to back to `0.5.1-incubating` version, then try to find a valid way to solve 
it. : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka 
offset reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586290053
 
 
   Because use `0.5.0-incubating` can't solve your problem quickly, so we need 
to back to `0.5.1-incubating` version, then try to find a valid way to solve 
it. : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken edited a comment on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka 
offset reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586285604
 
 
   That pr is used to check up checkpoint offsets is valid or not. Bad to see 
use `0.5.0-incubating` version didn't work. BTW, what's your kafka server 
version?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset 
reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586290053
 
 
   We need to back to `0.5.1-incubating` version, then try to find a valid way 
to solve it. : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-14 Thread GitBox
leesf commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379428759
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -84,55 +102,53 @@ using the hive session property for incremental queries: 
`set hive.fetch.task.co
 would ensure Map Reduce execution is chosen for a Hive query, which combines 
partitions (comma
 separated) and calls InputFormat.listStatus() only once with all those 
partitions.
 
-## Spark
+## Spark datasource
 
 Review comment:
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill edited a comment on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-14 Thread GitBox
adamjoneill edited a comment on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-586284798
 
 
   @vinothchandar I've managed to reproduce with a simple spark.parallelize() 
example.
   
   test.scala
   ```
   import org.apache.spark._
   import org.apache.spark.sql._
   import org.apache.spark.sql.functions.{month, year, col, dayofmonth}
   import org.apache.spark.storage.StorageLevel
   import org.apache.spark.streaming.{Milliseconds, StreamingContext}
   import org.apache.spark.streaming.Duration
   import org.apache.spark.streaming.kinesis._
   import org.apache.spark.streaming.kinesis.KinesisInputDStream
   import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.config.HoodieWriteConfig
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   
   import org.apache.spark.sql.types._
   import org.apache.spark.sql.DataFrame
   
   object HudiScalaStreamHelloWorld {
   
   def main(args: Array[String]): Unit = {  
   val appName = "ScalaStreamExample"
   val batchInterval = Milliseconds(2000)
   val spark = SparkSession
   .builder()
   .appName(appName)
   .getOrCreate()
   
   val sparkContext = spark.sparkContext
   val streamingContext = new StreamingContext(sparkContext, 
batchInterval)
   
   import spark.implicits._
   val sc = sparkContext
   
   case class Bar(id: Int, name: String)
   
   // Uncomment following section based on example
   
   // START - Simple object included in array item
   // case class Foo(id: Int, bar: Bar) // foo with simple object
   // END - Simple object included in array item
   
   // START - Simple object not present in array item
   // missing the id: Int property
   case class Foo(bar: Bar) // foo without simple object
   // END - Simple object not present in array item
   
   
   case class Root(id: Int, foos: Array[Foo])
   
   // Uncomment following section based on example
   
   // START - Simple object included in array item
   // with simple item on foo in array
   // val dataFrame = sc.parallelize(Seq(Root(1, Array(Foo(1, Bar(1, 
"OneBar")).toDF()
   // END - Simple object included in array item
   
   // START - Simple object not present in array item
   // without simple item on foo in array
   val dataFrame = sc.parallelize(Seq(Root(1, Array(Foo(Bar(1, 
"OneBar")).toDF()
   // END - Simple object not present in array item
   
   dataFrame.printSchema()
   
   val hudiTableName = "order"
   val hudiTablePath = "s3://xxx-/path/" + hudiTableName
   
   // Set up our Hudi Data Source Options
   val hudiOptions = Map[String,String](
   DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
   HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
   DataSourceWriteOptions.OPERATION_OPT_KEY ->
   DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
   DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "id")
   
   
dataFrame.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath)
   }
   }
   ```
   
   deploy.sh
   ```
   sbt clean
   sbt package
   
   aws s3 cp ./target/scala-2.11/simple-project_2.11-1.0.jar 
s3://./simple-project_2.11-1.0.jar
   
   aws emr add-steps --cluster-id j-AZQBZK81NAFT --steps 
Type=spark,Name=SimpleHudiTest,Args=[\
   --deploy-mode,cluster,\
   --master,yarn,\
   
--packages,\'org.apache.hudi:hudi-spark-bundle:0.5.0-incubating,org.apache.spark:spark-avro_2.11:2.4.4\',\
   --conf,spark.yarn.submit.waitAppCompletion=false,\
   --conf,yarn.log-aggregation-enable=true,\
   --conf,spark.dynamicAllocation.enabled=true,\
   --conf,spark.cores.max=4,\
   --conf,spark.network.timeout=300,\
   --conf,spark.serializer=org.apache.spark.serializer.KryoSerializer,\
   --conf,spark.sql.hive.convertMetastoreParquet=false,\
   --class,HudiScalaStreamHelloWorld,\
   s3://.xxx/simple-project_2.11-1.0.jar\
   ],ActionOnFailure=CONTINUE
   ```
   
   build.sbt
   ```
   name := "Simple Project"
   
   version := "1.0"
   
   scalaVersion := "2.11.12"
   
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
   libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4"
   libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % 
"2.4.4"
   libraryDependencies += "org.apache.hudi" % "hudi-spark-bundle" % 
"0.5.0-incubating"
   
   
   scalacOptions := Seq("-unchecked", "-deprecation")
   
   ```
   
   AWS glue job runs over the output s3 directory. 
   
   From the p

[GitHub] [incubator-hudi] lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset 
reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586285604
 
 
   That pr is used to check up checkpoint offsets is valid or not. Bad to see 
use `0.5.0-incubating` version didn't work. BTW, what's your kafka server 
version.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] adamjoneill commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-14 Thread GitBox
adamjoneill commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-586284798
 
 
   @vinothchandar I've managed to reproduce with a simple spark.parallelize() 
example.
   
   test.scala
   ```
   import org.apache.spark._
   import org.apache.spark.sql._
   import org.apache.spark.sql.functions.{month, year, col, dayofmonth}
   import org.apache.spark.storage.StorageLevel
   import org.apache.spark.streaming.{Milliseconds, StreamingContext}
   import org.apache.spark.streaming.Duration
   import org.apache.spark.streaming.kinesis._
   import org.apache.spark.streaming.kinesis.KinesisInputDStream
   import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.config.HoodieWriteConfig
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   
   import org.apache.spark.sql.types._
   import org.apache.spark.sql.DataFrame
   
   object HudiScalaStreamHelloWorld {
   
   def main(args: Array[String]): Unit = {  
   val appName = "ScalaStreamExample"
   val batchInterval = Milliseconds(2000)
   val spark = SparkSession
   .builder()
   .appName(appName)
   .getOrCreate()
   
   val sparkContext = spark.sparkContext
   val streamingContext = new StreamingContext(sparkContext, 
batchInterval)
   
   import spark.implicits._
   val sc = sparkContext
   
   case class Bar(id: Int, name: String)
   
   // Uncomment following section based on example
   
   // START - Simple object included in array item
   // case class Foo(id: Int, bar: Bar) // foo with simple object
   // END - Simple object included in array item
   
   // START - Simple object not present in array item
   // missing the id: Int property
   case class Foo(bar: Bar) // foo without simple object
   // END - Simple object not present in array item
   
   
   case class Root(id: Int, foos: Array[Foo])
   
   // Uncomment following section based on example
   
   // START - Simple object included in array item
   // with simple item on foo in array
   // val dataFrame = sc.parallelize(Seq(Root(1, Array(Foo(1, Bar(1, 
"OneBar")).toDF()
   // END - Simple object included in array item
   
   // START - Simple object not present in array item
   // without simple item on foo in array
   val dataFrame = sc.parallelize(Seq(Root(1, Array(Foo(Bar(1, 
"OneBar")).toDF()
   // END - Simple object not present in array item
   
   dataFrame.printSchema()
   
   val hudiTableName = "order"
   val hudiTablePath = "s3://xxx-/path/" + hudiTableName
   
   // Set up our Hudi Data Source Options
   val hudiOptions = Map[String,String](
   DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
   HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
   DataSourceWriteOptions.OPERATION_OPT_KEY ->
   DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
   DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "id")
   
   
dataFrame.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath)
   }
   }
   ```
   
   deploy.sh
   ```
   sbt clean
   sbt package
   
   aws s3 cp ./target/scala-2.11/simple-project_2.11-1.0.jar 
s3://./simple-project_2.11-1.0.jar
   
   aws emr add-steps --cluster-id j-AZQBZK81NAFT --steps 
Type=spark,Name=SimpleHudiTest,Args=[\
   --deploy-mode,cluster,\
   --master,yarn,\
   
--packages,\'org.apache.hudi:hudi-spark-bundle:0.5.0-incubating,org.apache.spark:spark-avro_2.11:2.4.4\',\
   --conf,spark.yarn.submit.waitAppCompletion=false,\
   --conf,yarn.log-aggregation-enable=true,\
   --conf,spark.dynamicAllocation.enabled=true,\
   --conf,spark.cores.max=4,\
   --conf,spark.network.timeout=300,\
   --conf,spark.serializer=org.apache.spark.serializer.KryoSerializer,\
   --conf,spark.sql.hive.convertMetastoreParquet=false,\
   --class,HudiScalaStreamHelloWorld,\
   s3://.xxx/simple-project_2.11-1.0.jar\
   ],ActionOnFailure=CONTINUE
   ```
   
   build.sbt
   ```
   name := "Simple Project"
   
   version := "1.0"
   
   scalaVersion := "2.11.12"
   
   libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
   libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4"
   libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % 
"2.4.4"
   libraryDependencies += "org.apache.hudi" % "hudi-spark-bundle" % 
"0.5.0-incubating"
   
   
   scalacOptions := Seq("-unchecked", "-deprecation")
   
   ```
   
   AWS glue job runs over the output s3 directory. 
   
   From the presto E

[jira] [Created] (HUDI-613) Refactor and enhance the Transformer component

2020-02-14 Thread vinoyang (Jira)
vinoyang created HUDI-613:
-

 Summary: Refactor and enhance the Transformer component
 Key: HUDI-613
 URL: https://issues.apache.org/jira/browse/HUDI-613
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: vinoyang


Currently, Hudi has a component that has not been widely used: Transformer. As 
we all know, before the original data fell into the data lake, a very common 
operation is data preprocessing and ETL. This is also the most common use 
scenario of many computing engines, such as Flink and Spark. Now that Hudi has 
taken advantage of the power of the computing engine, it can also naturally 
take advantage of its ability of data preprocessing. We can refactor the 
Transformer to make it become more flexible. To summarize, we can refactor from 
the following aspects:

* Decouple Transformer from Spark
* Enrich the Transformer and provide built-in transformer
* Support Transformer-chain

For the first point, the Transformer interface is tightly coupled with Spark in 
design, and it contains a Spark-specific context. This makes it impossible for 
us to take advantage of the transform capabilities provided by other engines 
(such as Flink) after supporting multiple engines. Therefore, we need to 
decouple it from Spark in design.

For the second point, we can enhance the Transformer and provide some 
out-of-the-box Transformers, such as FilterTransformer, FlatMapTrnasformer, and 
so on.

For the third point, the most common pattern for data processing is the 
pipeline model, and the common implementation of the pipeline model is the 
responsibility chain model, which can be compared to the Apache commons 
chain[1], combining multiple Transformers can make data-processing become more 
flexible and expandable.

If we enhance the capabilities of Transformer components, Hudi will provide 
richer data processing capabilities based on the computing engine.

The relevant discussion thread is here: 
https://lists.apache.org/thread.html/rfad2e71fc432922ca567432b7b6e1dd9c3bb102822177b73dbff2d90%40%3Cdev.hudi.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf commented on a change in pull request #1333: [HUDI-589][DOCS] Fix querying_data page

2020-02-14 Thread GitBox
leesf commented on a change in pull request #1333: [HUDI-589][DOCS] Fix 
querying_data page
URL: https://github.com/apache/incubator-hudi/pull/1333#discussion_r379418568
 
 

 ##
 File path: docs/_docs/2_3_querying_data.md
 ##
 @@ -9,7 +9,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 
 Conceptually, Hudi stores data physically once on DFS, while providing 3 
different ways of querying, as explained 
[before](/docs/concepts.html#query-types). 
 Once the table is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark and Presto.
+bundle has been provided, the table can be queried by popular query engines 
like Hive, Spark datasource, Spark SQL and Presto.
 
 Review comment:
   should we also mention the impala?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] amitsingh-10 commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
amitsingh-10 commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka 
offset reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586213714
 
 
   Thanks for the reply @lamber-ken. I tried working with `0.5.0-incubating` 
however I am getting the following error now :
   ```
   User class threw exception: org.apache.spark.SparkException: Offsets not 
available on leader
   ```
   
   The stacktrace for the same is :
   ```
   User class threw exception: org.apache.spark.SparkException: Offsets not 
available on leader: OffsetRange(topic: 'test-topic', partition: 0, range: [0 
-> 6667]),OffsetRange(topic: 'test-topic', partition: 1, range: [0 -> 
6667]),OffsetRange(topic: 'test-topic', partition: 2, range: [0 -> ])
   org.apache.spark.SparkException: Offsets not available on leader: 
OffsetRange(topic: 'test-topic', partition: 0, range: [0 -> 
6667]),OffsetRange(topic: 'test-topic', partition: 1, range: [0 -> 
6667]),OffsetRange(topic: 'test-topic', partition: 2, range: [0 -> ])
at 
org.apache.spark.streaming.kafka.KafkaUtils$.org$apache$spark$streaming$kafka$KafkaUtils$$checkOffsets(KafkaUtils.scala:200)
at 
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$1.apply(KafkaUtils.scala:253)
at 
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$1.apply(KafkaUtils.scala:249)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:699)
at 
org.apache.spark.streaming.kafka.KafkaUtils$.createRDD(KafkaUtils.scala:249)
at 
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$3.apply(KafkaUtils.scala:338)
at 
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$3.apply(KafkaUtils.scala:333)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:699)
at 
org.apache.spark.streaming.kafka.KafkaUtils$.createRDD(KafkaUtils.scala:333)
at 
org.apache.spark.streaming.kafka.KafkaUtils.createRDD(KafkaUtils.scala)
at 
org.apache.hudi.utilities.sources.AvroKafkaSource.toRDD(AvroKafkaSource.java:67)
at 
org.apache.hudi.utilities.sources.AvroKafkaSource.fetchNewData(AvroKafkaSource.java:61)
at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:71)
at 
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInAvroFormat(SourceFormatAdapter.java:61)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:292)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:214)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:120)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:292)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
   ```
   
   This was fixed by following [pull 
request](https://github.com/apache/incubator-hudi/pull/650) and is available in 
the `0.5.0-incubating` version of Hudi. Correct me, if I am wrong. I have 
changed the `auto.offset.reset=smallest` in the properties if this is relevant.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset reset config not being read

2020-02-14 Thread GitBox
lamber-ken commented on issue #1335: [SUPPORT] HoodieDeltaStreamer Kafka offset 
reset config not being read
URL: https://github.com/apache/incubator-hudi/issues/1335#issuecomment-586170746
 
 
   hi @amitsingh-10, `0.5.1-incubating` built with spark-2.4.4, as you 
reported, `KafkaUtils#fixKafkaParams` will overried these properties. 
   
   IMO, you can use `0.5.0-incubating` which built with spark-2.1.0. 
`KafkaUtils` does not overried these properties.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] jinshuangxian commented on issue #954: org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: table not found

2020-02-14 Thread GitBox
jinshuangxian commented on issue #954:  
org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: 
 table not found
URL: https://github.com/apache/incubator-hudi/issues/954#issuecomment-586161952
 
 
   > @gfn9cho you are right Glue Catalog does not support Primary Key. Its not 
actually a problem with Glue service, but its EMR's glue client implementation 
that returns a null because primary key is not supported. Hive is not able to 
deal with it correctly.
   > Hi, is there a new version of 
aws-glue-data-catalog-client-for-apache-hive-metastore
   > At this point, we cannot just give you something that would make it work. 
Please not, at this point Hudi is not an officially support application on EMR. 
It should be supported by mid/end of November, which is when this issue will be 
fixed in EMRs side as well.
   > 
   > If you cannot wait until then, here is one way to unblock yourself:
   > 
   > * Checkout glue catalog client package which is open sourced, and modify 
this line to return an empty list instead: 
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-hive2-client/src/main/java/com/amazonaws/glue/catalog/metastore/AWSCatalogMetastoreClient.java#L1630
   > * SSH to EMR master and replace the jar under 
`/usr/share/aws/hmclient/lib/`
   > * Restart hive-server2 to pick up new library: `sudo stop hive-server2` 
`sudo start hive-server2`
   
   
   
   > @gfn9cho you are right Glue Catalog does not support Primary Key. Its not 
actually a problem with Glue service, but its EMR's glue client implementation 
that returns a null because primary key is not supported. Hive is not able to 
deal with it correctly.
   > 
   > At this point, we cannot just give you something that would make it work. 
Please not, at this point Hudi is not an officially support application on EMR. 
It should be supported by mid/end of November, which is when this issue will be 
fixed in EMRs side as well.
   > 
   > If you cannot wait until then, here is one way to unblock yourself:
   > 
   > * Checkout glue catalog client package which is open sourced, and modify 
this line to return an empty list instead: 
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/master/aws-glue-datacatalog-hive2-client/src/main/java/com/amazonaws/glue/catalog/metastore/AWSCatalogMetastoreClient.java#L1630
   > * SSH to EMR master and replace the jar under 
`/usr/share/aws/hmclient/lib/`
   > * Restart hive-server2 to pick up new library: `sudo stop hive-server2` 
`sudo start hive-server2`
   
   Hi, is there a new version of 
aws-glue-data-catalog-client-for-apache-hive-metastore?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-02-14 Thread GitBox
pratyakshsharma commented on issue #1165: [HUDI-76] Add CSV Source support for 
Hudi Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-586154037
 
 
   @vinothchandar will review it by EOD today. :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] amitsingh-10 opened a new issue #1335: [SUPPORT] HoodieDeltaStreamer offset reset not working

2020-02-14 Thread GitBox
amitsingh-10 opened a new issue #1335: [SUPPORT] HoodieDeltaStreamer offset 
reset not working
URL: https://github.com/apache/incubator-hudi/issues/1335
 
 
   **Describe the problem you faced**
   I am trying to create a Hoodie table using `HoodieDeltaStreamer` from Kafka 
Avro topic. Setting `auto.offset.reset` to `earliest` should read from the 
first events available in the Kafka topic however, it is being overriden by 
`org.apache.spark.streaming.kafka010.KafkaUtil`.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Use an EMR emr-5.23.0 cluster.
   2. Create a custom properties file.
   ```
   # General properties
   hoodie.datasource.write.table.type=MERGE_ON_READ
   hoodie.datasource.write.recordkey.field=id
   hoodie.datasource.write.partitionpath.field=timestamp
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator
   hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
   hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
   
   # Kafka properties
   enable.auto.commit=false
   auto.offset.reset=earliest
   group.id=hudi_test_group
   
schema.registry.url=http://confluent*:8081/subjects/test_table/versions/latest
   bootstrap.servers=confluent.:9092
   hoodie.deltastreamer.source.kafka.topic=test-topic
   
hoodie.deltastreamer.schemaprovider.registry.url=http://confluent***:8081/subjects/test_table/versions/latest
   hoodie.deltastreamer.kafka.source.maxEvents=1
   
   # Hive metastore properties
   hoodie.datasource.hive_sync.database=hudi_test
   hoodie.datasource.hive_sync.table=test_table
   hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver.***:1
   hoodie.datasource.hive_sync.assume_date_partitioning=true
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
   ```
   3. Run `spark-submit` with `HoodieDeltaStreamer`.
   ```
   spark-submit  --queue ingestion --deploy-mode cluster --master yarn \
   --conf 'spark.jars=/home/hadoop/hudi*.jar'  \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls 
hudi-utilities*.jar` \ 
   --op UPSERT --source-class org.apache.hudi.utilities.sources.AvroKafkaSource 
\
   --table-type MERGE_ON_READ --target-base-path s3:\/\/hudi-bucket/test_table \
   --target-table hudi_test.test_table --source-limit 200 \ 
   --enable-hive-sync \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --props s3:\/\/hudi-bucket/hudi_custom.properties
   ```
   
   **Expected behavior**
   The data should be read from the earliest/latest available offset in Kafka 
as per the config.
   
   **Environment Description**
   
   * EMR version : emr-5.23.0
   
   * Hudi version : 0.5.1-incubating
   
   * Spark version : 2.4.0
   
   * Hive version : 2.3.4
   
   * Hadoop version : 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   The issue about Spark Kafka overriding `auto.offset.reset` is being tracked 
by Kafka JIRA issue 
([KAFKA-4396](https://issues.apache.org/jira/browse/KAFKA-4396)) and Spark 
issue [(SPARK-19680](https://issues.apache.org/jira/browse/SPARK-19680)).
   
   From what I can see from the logs, the provided config is being read 
correctly.
   ```
   auto.commit.interval.ms = 5000
   auto.offset.reset = earliest
   bootstrap.servers = [confluent.*:9092]
   connections.max.idle.ms = 54
   default.api.timeout.ms = 6
   enable.auto.commit = false
   ...
   ```
   
   However, fixKafkaParams is resetting the `auto.offsets.reset` property.
   ```
   2020-02-14 07:16:18,664 WARN [Driver] 
org.apache.spark.streaming.kafka010.KafkaUtils:overriding enable.auto.commit to 
false for executor
   2020-02-14 07:16:18,665 WARN [Driver] 
org.apache.spark.streaming.kafka010.KafkaUtils:overriding auto.offset.reset to 
none for executor
   2020-02-14 07:16:18,666 WARN [Driver] 
org.apache.spark.streaming.kafka010.KafkaUtils:overriding executor group.id to 
spark-executor-hudi_test_group
   2020-02-14 07:16:18,666 WARN [Driver] 
org.apache.spark.streaming.kafka010.KafkaUtils:overriding receive.buffer.bytes 
to 65536 see KAFKA-3135
   ```
   
   
   **Stacktrace**
   
   ```Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2039)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2027)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2026)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.