date:20200323

[GitHub] [incubator-hudi] yanghua commented on issue #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

yanghua commented on issue #1440: [HUDI-731] Add ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#issuecomment-603045516
 
 
@xushiyan 's explanation is correct. It's just about trade-off. Both of 
them can achieve the same function, but the mechanism of collecting 
`Transformer` is different. Considering that we have limited the functionality 
of this `Transformer` to: `ChainedTransformer`. Maybe we can hide the 
"collect/assemble" logic to users? From this perspective, perhaps 
@vinothchandar 's suggestion is a little better for users. 
   
   After all, in terms of flexibility, if users want more flexibility, they can 
even directly inherit the `Transformer` interface and implement everything you 
currently implement, right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-732) Generate site to content folder

2020-03-23 Thread lamber-ken (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-732:

Status: Open  (was: New)

> Generate site to content folder
> ---
>
> Key: HUDI-732
> URL: https://issues.apache.org/jira/browse/HUDI-732
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Remove test-content && Generate site to content



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-732) Generate site to content folder

2020-03-23 Thread lamber-ken (Jira)

lamber-ken created HUDI-732:
---

 Summary: Generate site to content folder
 Key: HUDI-732
 URL: https://issues.apache.org/jira/browse/HUDI-732
 Project: Apache Hudi (incubating)
  Issue Type: Task
  Components: Docs
Reporter: lamber-ken


Remove test-content && Generate site to content



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-732) Generate site to content folder

2020-03-23 Thread lamber-ken (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-732:
---

Assignee: lamber-ken

> Generate site to content folder
> ---
>
> Key: HUDI-732
> URL: https://issues.apache.org/jira/browse/HUDI-732
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Remove test-content && Generate site to content



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-23 Thread lamber-ken (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065030#comment-17065030
 ] 

lamber-ken edited comment on HUDI-686 at 3/24/20, 5:41 AM:
---

right, this is a nice design, some thoughts:
 * if the input data is large, need to increase partitions, "candidates" 
contains all datas for per partition
 * if increase partitions, it will cause duplicate loading of the same 
partition(e.g populateFileIDs() && populateRangeAndBloomFilters())

[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD> tagLocation(JavaRDD> recordRDD,
JavaSparkContext jsc,
HoodieTable hoodieTable) {
  return recordRDD.sortBy((record) -> String.format("%s-%s", 
record.getPartitionPath(), record.getRecordKey()),
  true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
  .flatMap(List::iterator)
  .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
  .filter(Option::isPresent)
  .map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
  if (!Objects.equals(partitionPath, currentPartitionPath)) {
cleanup();
this.currentPartitionPath = partitionPath;
populateFileIDs();
populateRangeAndBloomFilters();
  }
}{code}


was (Author: lamber-ken):
right, this is a nice design, some thoughts:
 * if the input data is large, need to increase partitions, "candidates" 
contains all partition datas
 * if increase partitions, it will cause duplicate loading of the same 
partition(e.g populateFileIDs() && populateRangeAndBloomFilters())

[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD> tagLocation(JavaRDD> recordRDD,
JavaSparkContext jsc,
HoodieTable hoodieTable) {
  return recordRDD.sortBy((record) -> String.format("%s-%s", 
record.getPartitionPath(), record.getRecordKey()),
  true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
  .flatMap(List::iterator)
  .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
  .filter(Option::isPresent)
  .map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
  if (!Objects.equals(partitionPath, currentPartitionPath)) {
cleanup();
this.currentPartitionPath = partitionPath;
populateFileIDs();
populateRangeAndBloomFilters();
  }
}{code}

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-504) Restructuring and auto-generation of docs

2020-03-23 Thread lamber-ken (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed HUDI-504.
---
Resolution: Fixed

> Restructuring and auto-generation of docs
> -
>
> Key: HUDI-504
> URL: https://issues.apache.org/jira/browse/HUDI-504
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Ethan Guo
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> RFC-10: Restructuring and auto-generation of docs
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+10+%3A+Restructuring+and+auto-generation+of+docs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-646) Re-enable TestUpdateSchemaEvolution after triaging weird CI issue

2020-03-23 Thread lamber-ken (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed HUDI-646.
---
Resolution: Fixed

> Re-enable TestUpdateSchemaEvolution after triaging weird CI issue
> -
>
> Key: HUDI-646
> URL: https://issues.apache.org/jira/browse/HUDI-646
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Testing
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://github.com/apache/incubator-hudi/pull/1346/commits/5b20891619380a66e2a62c9e57fb28c4f5ed948b
>  undo this
> {code}
> Job aborted due to stage failure: Task 7 in stage 1.0 failed 1 times, most 
> recent failure: Lost task 7.0 in stage 1.0 (TID 15, localhost, executor 
> driver): org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file 
> file:/tmp/junit3406952253616234024/2016/01/31/f1-0_7-0-7_100.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readAvroRecords(ParquetUtils.java:190)
>   at 
> org.apache.hudi.client.TestUpdateSchemaEvolution.lambda$testSchemaEvolutionOnUpdate$dfb2f24e$1(TestUpdateSchemaEvolution.java:123)
>   at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1334)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsupportedOperationException: Byte-buffer read 
> unsupported by input stream
>   at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:146)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:143)
>   at 
> org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:81)
>   at 
> org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:90)
>   at 
> org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:75)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
>   ... 29 more
> {code}
> Only happens on travis. Locally succeeded over 5000 times individually.. And 
> the entir

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

codecov-io edited a comment on issue #1440: [HUDI-731] Add ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#issuecomment-602962209
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=h1) 
Report
   > Merging 
[#1440](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/cafc87041baf4055c39244e7cde0187437bb03d4&el=desc)
 will **decrease** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1440/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1440  +/-   ##
   
   - Coverage 67.61%   67.60%   -0.01% 
   - Complexity  254  257   +3 
   
 Files   340  341   +1 
 Lines 1650416510   +6 
 Branches   1689 1690   +1 
   
   + Hits  1115911162   +3 
   - Misses 4606 4609   +3 
 Partials739  739  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...e/hudi/utilities/transform/ChainedTransformer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1440/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9DaGFpbmVkVHJhbnNmb3JtZXIuamF2YQ==)
 | `100.00% <100.00%> (ø)` | `3.00 <3.00> (?)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1440/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `40.00% <0.00%> (-60.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=footer).
 Last update 
[cafc870...c2f9c32](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: [HUDI-504] Restructuring and auto-generation of docs (#1412)

2020-03-23 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 9c80aa5  [HUDI-504] Restructuring and auto-generation of docs (#1412)
9c80aa5 is described below

commit 9c80aa53747d3a26709573b8a53caa7d558d13b7
Author: lamber-ken 
AuthorDate: Mon Mar 23 23:47:49 2020 -0500

[HUDI-504] Restructuring and auto-generation of docs (#1412)

* [HUDI-504] Restructuring and auto-generation of docs

* mkdir test-content folder firstly
---
 .travis.yml | 41 +
 docs/_includes/nav_list | 21 ++---
 2 files changed, 51 insertions(+), 11 deletions(-)

diff --git a/.travis.yml b/.travis.yml
new file mode 100644
index 000..1f28d97
--- /dev/null
+++ b/.travis.yml
@@ -0,0 +1,41 @@
+language: ruby
+rvm:
+  - 2.6.3
+
+env:
+  global:
+- GIT_USER="CI BOT"
+- GIT_EMAIL="ci...@hudi.apache.org"
+- GIT_REPO="apache"
+- GIT_PROJECT="incubator-hudi"
+- GIT_BRANCH="asf-site"
+- DOCS_ROOT="`pwd`/docs"
+
+before_install:
+  - if [ "$(git show -s --format=%ae)" = "${GIT_EMAIL}" ]; then echo "avoid 
recursion, ignore ..."; exit 0; fi
+  - git config --global user.name ${GIT_USER}
+  - git config --global user.email ${GIT_EMAIL}
+  - git remote add hudi 
https://${GIT_TOKEN}@github.com/${GIT_REPO}/${GIT_PROJECT}.git
+  - git checkout -b pr
+  - git pull --rebase hudi asf-site
+
+script:
+  - pushd ${DOCS_ROOT}
+  - gem install bundler:2.0.2
+  - bundle install
+  - bundle update --bundler
+  - bundle exec jekyll build _config.yml --source . --destination _site
+  - popd
+
+after_success:
+  - echo $TRAVIS_PULL_REQUEST
+  - 'if [ "$TRAVIS_PULL_REQUEST" != "false" ]; then echo "ignore push build 
result for per submit"; exit 0; fi'
+  - 'if [ "$TRAVIS_PULL_REQUEST" = "false" ]; then echo "pushing build result 
..."; fi'
+  - mkdir test-content && \cp -rf ${DOCS_ROOT}/_site/* test-content
+  - git add -A
+  - git commit -am "Travis CI build asf-site"
+  - git push hudi pr:asf-site
+
+branches:
+  only:
+- asf-site
\ No newline at end of file
diff --git a/docs/_includes/nav_list b/docs/_includes/nav_list
index 59d3f4e..50084d2 100644
--- a/docs/_includes/nav_list
+++ b/docs/_includes/nav_list
@@ -18,20 +18,19 @@
 {% assign menu_label = "文档菜单" %}
 {% endif %}
 {% elsif page.version == "0.5.1" %}
-{% assign navigation = site.data.navigation["0.5.1_docs"] %}
+{% assign navigation = site.data.navigation["0.5.1_docs"] %}
 
-{% if page.language == "cn" %}
-{% assign navigation = site.data.navigation["0.5.1_cn_docs"] %}
-{% assign menu_label = "文档菜单" %}
-{% endif %}
-{% endif %}
+{% if page.language == "cn" %}
+{% assign navigation = site.data.navigation["0.5.1_cn_docs"] %}
+{% assign menu_label = "文档菜单" %}
+{% endif %}
 {% elsif page.version == "0.5.2" %}
-{% assign navigation = site.data.navigation["0.5.2_docs"] %}
+{% assign navigation = site.data.navigation["0.5.2_docs"] %}
 
-{% if page.language == "cn" %}
-{% assign navigation = site.data.navigation["0.5.2_cn_docs"] %}
-{% assign menu_label = "文档菜单" %}
-{% endif %}
+{% if page.language == "cn" %}
+{% assign navigation = site.data.navigation["0.5.2_cn_docs"] %}
+{% assign menu_label = "文档菜单" %}
+{% endif %}
 {% endif %}
 {% endif %}

[GitHub] [incubator-hudi] xushiyan commented on issue #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

xushiyan commented on issue #1440: [HUDI-731] Add ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#issuecomment-603012975
 
 
   @vinothchandar The current approach is also compatible with existing CLI 
usage. The user extends this abstract class and plugs that custom class to 
`--transformer-class` param. So the difference is: transformer list is supplied 
from user's code instead of being parsed from CLI args. 
   
   My reasoning for this implementation is: it saves some parsing logic to 
maintain and users could have more flexibility in how they supply transformers 
(via their own code).
   
   Though saying that, I don't mind making the transformers parsed from CLI. 
It's kinda user experience trade-off.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar merged pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-23 Thread GitBox

vinothchandar merged pull request #1412: [HUDI-504] Restructuring and 
auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1438: How to get the file name corresponding to HoodieKey through the GlobalBloomIndex

2020-03-23 Thread GitBox

vinothchandar commented on issue #1438: How to get the file name corresponding 
to HoodieKey through the GlobalBloomIndex 
URL: https://github.com/apache/incubator-hudi/issues/1438#issuecomment-603008577
 
 
   That does not count right.. may be there is a gap here w.r.t Global index? 
   
   @nsivabalan do you have cycles to check this out.. (indexing related) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

xushiyan commented on a change in pull request #1440: [HUDI-731] Add 
ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#discussion_r396894929
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java
 ##
 @@ -0,0 +1,30 @@
+package org.apache.hudi.utilities.transform;
+
+import org.apache.hudi.common.util.TypedProperties;
+
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.List;
+import java.util.function.Supplier;
+
+/**
+ * An abstract {@link Transformer} to chain other {@link Transformer}s and 
apply sequentially.
+ * 
+ * A subclass is to supply a {@link List} of {@link Transformer}s in desired 
sequence.
+ */
+public abstract class ChainedTransformer implements Transformer {
 
 Review comment:
   will do the docs change once this merges. thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

xushiyan commented on a change in pull request #1440: [HUDI-731] Add 
ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#discussion_r396894929
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java
 ##
 @@ -0,0 +1,30 @@
+package org.apache.hudi.utilities.transform;
+
+import org.apache.hudi.common.util.TypedProperties;
+
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.List;
+import java.util.function.Supplier;
+
+/**
+ * An abstract {@link Transformer} to chain other {@link Transformer}s and 
apply sequentially.
+ * 
+ * A subclass is to supply a {@link List} of {@link Transformer}s in desired 
sequence.
+ */
+public abstract class ChainedTransformer implements Transformer {
 
 Review comment:
   will do the docs change once this approved. thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #226

2020-03-23 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.36 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
or

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

yanghua commented on a change in pull request #1440: [HUDI-731] Add 
ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#discussion_r396879048
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java
 ##
 @@ -0,0 +1,30 @@
+package org.apache.hudi.utilities.transform;
 
 Review comment:
   We can add a new section in the[ write documentation 
page](http://hudi.apache.org/docs/writing_data.html). The branch of 
documentation is `asf-site`. So it's OK to describe it in another PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

2020-03-23 Thread GitBox

umehrot2 commented on a change in pull request #1427: [HUDI-727]: Copy default 
values of fields if not present when rewriting incoming record with new schema
URL: https://github.com/apache/incubator-hudi/pull/1427#discussion_r396877351
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/util/TestHoodieAvroUtils.java
 ##
 @@ -57,4 +60,16 @@ public void testPropsPresent() {
 }
 Assert.assertTrue("column pii_col doesn't show up", piiPresent);
   }
+
+  @Test
+  public void testDefaultValue() {
+GenericRecord rec = new GenericData.Record(new 
Schema.Parser().parse(EXAMPLE_SCHEMA));
+rec.put("_row_key", "key1");
+rec.put("non_pii_col", "val1");
+rec.put("pii_col", "val2");
+rec.put("timestamp", 3.5);
 
 Review comment:
   So the issue seems to be that in the original record created in this way, 
the default values shows up as `null`. Even though you have specified `default: 
dummy_val` it still is showing up as `null` in the original record.
   
   Do you know why that is the case ? When we have specified the default value, 
why doesn't Avro put it in the record when the field is missing ?
   
   I tried using the builder, but that expects default values to be specified 
for each and every field else throws an excpetion:
   ```
   GenericRecord rec = new GenericRecordBuilder(new 
Schema.Parser().parse(EXAMPLE_SCHEMA)).build();
   ```
   Do you have more research points around why this is the case with Avro ?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

2020-03-23 Thread GitBox

umehrot2 commented on a change in pull request #1427: [HUDI-727]: Copy default 
values of fields if not present when rewriting incoming record with new schema
URL: https://github.com/apache/incubator-hudi/pull/1427#discussion_r396877767
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java
 ##
 @@ -104,15 +103,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", null);
 
 Review comment:
   Are you making these changes to avoid use of deprecated APIs ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

xushiyan commented on a change in pull request #1440: [HUDI-731] Add 
ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#discussion_r396873427
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java
 ##
 @@ -0,0 +1,30 @@
+package org.apache.hudi.utilities.transform;
 
 Review comment:
   @yanghua Fixing it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

xushiyan commented on a change in pull request #1440: [HUDI-731] Add 
ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#discussion_r396873391
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java
 ##
 @@ -0,0 +1,30 @@
+package org.apache.hudi.utilities.transform;
+
+import org.apache.hudi.common.util.TypedProperties;
+
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.List;
+import java.util.function.Supplier;
+
+/**
+ * An abstract {@link Transformer} to chain other {@link Transformer}s and 
apply sequentially.
+ * 
+ * A subclass is to supply a {@link List} of {@link Transformer}s in desired 
sequence.
+ */
+public abstract class ChainedTransformer implements Transformer {
 
 Review comment:
   Sure @yanghua . Could you point me to where I can add this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

yanghua commented on a change in pull request #1440: [HUDI-731] Add 
ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#discussion_r396869571
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java
 ##
 @@ -0,0 +1,30 @@
+package org.apache.hudi.utilities.transform;
 
 Review comment:
   This class missed the Apache license header. And it's strange why the Travis 
is green? Does it not check this rule? cc @vinothchandar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

yanghua commented on a change in pull request #1440: [HUDI-731] Add 
ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#discussion_r396869939
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java
 ##
 @@ -0,0 +1,30 @@
+package org.apache.hudi.utilities.transform;
+
+import org.apache.hudi.common.util.TypedProperties;
+
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.List;
+import java.util.function.Supplier;
+
+/**
+ * An abstract {@link Transformer} to chain other {@link Transformer}s and 
apply sequentially.
+ * 
+ * A subclass is to supply a {@link List} of {@link Transformer}s in desired 
sequence.
+ */
+public abstract class ChainedTransformer implements Transformer {
+
+  protected abstract Supplier> supplyTransformers();
+
+  @Override
+  public Dataset apply(JavaSparkContext jsc, SparkSession sparkSession, 
Dataset rowDataset, TypedProperties properties) {
 
 Review comment:
   It would be better to have a test case which extends `ChainedTransformer ` 
and mock some transformer to verify this function.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

yanghua commented on a change in pull request #1440: [HUDI-731] Add 
ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#discussion_r396870236
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java
 ##
 @@ -0,0 +1,30 @@
+package org.apache.hudi.utilities.transform;
+
+import org.apache.hudi.common.util.TypedProperties;
+
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.List;
+import java.util.function.Supplier;
+
+/**
+ * An abstract {@link Transformer} to chain other {@link Transformer}s and 
apply sequentially.
+ * 
+ * A subclass is to supply a {@link List} of {@link Transformer}s in desired 
sequence.
+ */
+public abstract class ChainedTransformer implements Transformer {
 
 Review comment:
   It's a user-faced feature, we'd better to describe it in the documentation. 
WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

codecov-io edited a comment on issue #1440: [HUDI-731] Add ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#issuecomment-602962209
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=h1) 
Report
   > Merging 
[#1440](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/cafc87041baf4055c39244e7cde0187437bb03d4&el=desc)
 will **increase** coverage by `0.00%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1440/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1440   +/-   ##
   =
 Coverage 67.61%   67.61%   
 Complexity  254  254   
   =
 Files   340  341+1 
 Lines 1650416510+6 
 Branches   1689 1690+1 
   =
   + Hits  1115911164+5 
   - Misses 4606 4607+1 
 Partials739  739   
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...e/hudi/utilities/transform/ChainedTransformer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1440/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9DaGFpbmVkVHJhbnNmb3JtZXIuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1440/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `70.27% <0.00%> (+13.51%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=footer).
 Last update 
[cafc870...81b3b0c](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io commented on issue #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

codecov-io commented on issue #1440: [HUDI-731] Add ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440#issuecomment-602962209
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=h1) 
Report
   > Merging 
[#1440](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/cafc87041baf4055c39244e7cde0187437bb03d4&el=desc)
 will **increase** coverage by `0.00%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1440/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1440   +/-   ##
   =
 Coverage 67.61%   67.61%   
 Complexity  254  254   
   =
 Files   340  341+1 
 Lines 1650416510+6 
 Branches   1689 1690+1 
   =
   + Hits  1115911164+5 
   - Misses 4606 4607+1 
 Partials739  739   
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...e/hudi/utilities/transform/ChainedTransformer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1440/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9DaGFpbmVkVHJhbnNmb3JtZXIuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1440/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `70.27% <0.00%> (+13.51%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=footer).
 Last update 
[cafc870...81b3b0c](https://codecov.io/gh/apache/incubator-hudi/pull/1440?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-731) Implement a chained transformer for deltastreamer that can chain other transformer implementations

2020-03-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-731:

Labels: pull-request-available  (was: )

> Implement a chained transformer for deltastreamer that can chain other 
> transformer implementations
> --
>
> Key: HUDI-731
> URL: https://issues.apache.org/jira/browse/HUDI-731
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Utilities
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] xushiyan opened a new pull request #1440: [HUDI-731] Add ChainedTransformer

2020-03-23 Thread GitBox

xushiyan opened a new pull request #1440: [HUDI-731] Add ChainedTransformer
URL: https://github.com/apache/incubator-hudi/pull/1440
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-731) Implement a chained transformer for deltastreamer that can chain other transformer implementations

2020-03-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-731:

Status: Open  (was: New)

> Implement a chained transformer for deltastreamer that can chain other 
> transformer implementations
> --
>
> Key: HUDI-731
> URL: https://issues.apache.org/jira/browse/HUDI-731
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Utilities
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-731) Implement a chained transformer for deltastreamer that can chain other transformer implementations

2020-03-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-731:
---

Assignee: Raymond Xu

> Implement a chained transformer for deltastreamer that can chain other 
> transformer implementations
> --
>
> Key: HUDI-731
> URL: https://issues.apache.org/jira/browse/HUDI-731
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Utilities
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-731) Implement a chained transformer for deltastreamer that can chain other transformer implementations

2020-03-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-731:

Status: In Progress  (was: Open)

> Implement a chained transformer for deltastreamer that can chain other 
> transformer implementations
> --
>
> Key: HUDI-731
> URL: https://issues.apache.org/jira/browse/HUDI-731
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Utilities
>Reporter: Vinoth Chandar
>Assignee: Raymond Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema

2020-03-23 Thread GitBox

umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of 
struct type to Avro schema
URL: https://github.com/apache/incubator-hudi/pull/1406#issuecomment-602933007
 
 
   > > > So anyone who has written data using databricks-avro will face issues 
reading.
   > 
   > By this you mean, reading for merging data (i.e during ingestion/writing) 
or querying via Spark/Hive/Presto?
   
   Yeah I mean writing additional data using `spark-avro` on top of old table 
written with data-bricks avro. Querying should not be affected.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (HUDI-731) Implement a chained transformer for deltastreamer that can chain other transformer implementations

2020-03-23 Thread Vinoth Chandar (Jira)

Vinoth Chandar created HUDI-731:
---

 Summary: Implement a chained transformer for deltastreamer that 
can chain other transformer implementations
 Key: HUDI-731
 URL: https://issues.apache.org/jira/browse/HUDI-731
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: DeltaStreamer, Utilities
Reporter: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema

2020-03-23 Thread GitBox

vinothchandar commented on issue #1406: [HUDI-713] Fix conversion of Spark 
array of struct type to Avro schema
URL: https://github.com/apache/incubator-hudi/pull/1406#issuecomment-602881385
 
 
   >>So anyone who has written data using databricks-avro will face issues 
reading.
   By this you mean, reading for merging data (i.e during ingestion/writing) or 
querying via Spark/Hive/Presto? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema

2020-03-23 Thread GitBox

vinothchandar edited a comment on issue #1406: [HUDI-713] Fix conversion of 
Spark array of struct type to Avro schema
URL: https://github.com/apache/incubator-hudi/pull/1406#issuecomment-602881385
 
 
   >>So anyone who has written data using databricks-avro will face issues 
reading.
   
   By this you mean, reading for merging data (i.e during ingestion/writing) or 
querying via Spark/Hive/Presto? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 edited a comment on issue #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

2020-03-23 Thread GitBox

umehrot2 edited a comment on issue #1427: [HUDI-727]: Copy default values of 
fields if not present when rewriting incoming record with new schema
URL: https://github.com/apache/incubator-hudi/pull/1427#issuecomment-602847315
 
 
   > @umehrot2 could you please help review?
   
   Will take a look.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 commented on issue #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

2020-03-23 Thread GitBox

umehrot2 commented on issue #1427: [HUDI-727]: Copy default values of fields if 
not present when rewriting incoming record with new schema
URL: https://github.com/apache/incubator-hudi/pull/1427#issuecomment-602847315
 
 
   > @umehrot2 could you please help review?
   
   
   
   > @umehrot2 could you please help review?
   
   
   
   > @umehrot2 could you please help review?
   
   Will take a look.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 edited a comment on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema

2020-03-23 Thread GitBox

umehrot2 edited a comment on issue #1406: [HUDI-713] Fix conversion of Spark 
array of struct type to Avro schema
URL: https://github.com/apache/incubator-hudi/pull/1406#issuecomment-602846762
 
 
   > LGTM overall..
   > 
   > @umehrot2 @zhedoubushishi generally speaking, this schema namespace 
mismatch.. is this a backwards incompatible change.. i.e if we people have 
written data using 0.5.1, could they use master/0.6.0 to read and write without 
pain?
   
   @vinothchandar with 0.5.1 currently you cannot even write some of these 
complex data types like Array of structs etc. So this is actually a fix, and is 
not backwards incompatible with 0.5.1 since it uses `spark-avro`. However, it 
will be backwards incompatible with `databricks-avro`. So anyone who has 
written data using `databricks-avro` will face issues reading.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of struct type to Avro schema

2020-03-23 Thread GitBox

umehrot2 commented on issue #1406: [HUDI-713] Fix conversion of Spark array of 
struct type to Avro schema
URL: https://github.com/apache/incubator-hudi/pull/1406#issuecomment-602846762
 
 
   > LGTM overall..
   > 
   > @umehrot2 @zhedoubushishi generally speaking, this schema namespace 
mismatch.. is this a backwards incompatible change.. i.e if we people have 
written data using 0.5.1, could they use master/0.6.0 to read and write without 
pain?
   
   @vinothchandar with 0.5.1 currently you cannot even write some of these 
complex data types like Array or structs etc. So this is actually a fix, and is 
not backwards incompatible with 0.5.1 since it uses `spark-avro`. However, it 
will be backwards incompatible with `databricks-avro`. So anyone who has 
written data using `databricks-avro` will face issues reading.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1439: Hudi class loading problem

2020-03-23 Thread GitBox

lamber-ken commented on issue #1439: Hudi class loading problem
URL: https://github.com/apache/incubator-hudi/issues/1439#issuecomment-602812140
 
 
   I'm not familiar with apache tez, but from the above stracktrace, tez works 
on yarn cluster, 
   so I thinks it may work if we place `hudi-hadoop-mr-xx.jar` in hadoop 
classpath.
   
![image](https://user-images.githubusercontent.com/20113411/77355952-5e169080-6d80-11ea-80a7-4fb7ed502c1c.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-23 Thread lamber-ken (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065030#comment-17065030
 ] 

lamber-ken commented on HUDI-686:
-

right, this is a nice design, some thoughts:
 * if the input data is large, need to increase partitions, "candidates" 
contains all partition datas
 * if increase partitions, it will cause duplicate loading of the same 
partition(e.g populateFileIDs() && populateRangeAndBloomFilters())

[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD> tagLocation(JavaRDD> recordRDD,
JavaSparkContext jsc,
HoodieTable hoodieTable) {
  return recordRDD.sortBy((record) -> String.format("%s-%s", 
record.getPartitionPath(), record.getRecordKey()),
  true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
  .flatMap(List::iterator)
  .sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
  .mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
  .filter(Option::isPresent)
  .map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
  if (!Objects.equals(partitionPath, currentPartitionPath)) {
cleanup();
this.currentPartitionPath = partitionPath;
populateFileIDs();
populateRangeAndBloomFilters();
  }
}{code}

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-23 Thread GitBox

lamber-ken commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r396655762
 
 

 ##
 File path: .travis.yml
 ##
 @@ -31,7 +31,7 @@ after_success:
   - echo $TRAVIS_PULL_REQUEST
   - 'if [ "$TRAVIS_PULL_REQUEST" != "false" ]; then echo "ignore push build 
result for per submit"; exit 0; fi'
   - 'if [ "$TRAVIS_PULL_REQUEST" = "false" ]; then echo "pushing build result 
..."; fi'
-  - \cp -rf ${DOCS_ROOT}/_site/* test-content
+  - mkdir test-content && \cp -rf ${DOCS_ROOT}/_site/* test-content
 
 Review comment:
   if `test-content` doesn't exits, `\cp` command doesn't work in macOS, so I 
think create it before copy content to it is better as you mentioned.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-23 Thread GitBox

lamber-ken commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r396657847
 
 

 ##
 File path: .travis.yml
 ##
 @@ -31,7 +31,7 @@ after_success:
   - echo $TRAVIS_PULL_REQUEST
   - 'if [ "$TRAVIS_PULL_REQUEST" != "false" ]; then echo "ignore push build 
result for per submit"; exit 0; fi'
   - 'if [ "$TRAVIS_PULL_REQUEST" = "false" ]; then echo "pushing build result 
..."; fi'
-  - \cp -rf ${DOCS_ROOT}/_site/* test-content
+  - mkdir test-content && \cp -rf ${DOCS_ROOT}/_site/* test-content
 
 Review comment:
   it worked well during my previous tests in travis ci env.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-03-23 Thread Feichi Feng (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064994#comment-17064994
 ] 

Feichi Feng commented on HUDI-724:
--

Hi [~vbalaji], is there anything else I need to address for the PR? 

> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Time Spent: 0.5h
>  Remaining Estimate: 47.5h
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] satishkotha commented on issue #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-03-23 Thread GitBox

satishkotha commented on issue #1396: [HUDI-687] Stop incremental reader on RO 
table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#issuecomment-602769754
 
 
   @vinothchandar could you take another look at this one?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-23 Thread GitBox

lamber-ken commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r396655762
 
 

 ##
 File path: .travis.yml
 ##
 @@ -31,7 +31,7 @@ after_success:
   - echo $TRAVIS_PULL_REQUEST
   - 'if [ "$TRAVIS_PULL_REQUEST" != "false" ]; then echo "ignore push build 
result for per submit"; exit 0; fi'
   - 'if [ "$TRAVIS_PULL_REQUEST" = "false" ]; then echo "pushing build result 
..."; fi'
-  - \cp -rf ${DOCS_ROOT}/_site/* test-content
+  - mkdir test-content && \cp -rf ${DOCS_ROOT}/_site/* test-content
 
 Review comment:
   if `test-content` doesn't exits, `\cp` command doesn't work in macOS, so I 
think create it is better as you mentioned.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

2020-03-23 Thread GitBox

nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r396570732
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieDataFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.Optional;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.storage.StorageLevel;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+
+/**
+ * A simple index which reads interested fields from parquet and joins with 
incoming records to find the tagged location
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieBloomIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSimpleIndex.class);
+
+  public HoodieSimpleIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  /**
+   * Returns an RDD mapping each HoodieKey with a partitionPath/fileID which 
contains it. Option.Empty if the key is not
+   * found.
+   *
+   * @param hoodieKeys  keys to lookup
+   * @param jsc spark context
+   * @param hoodieTable hoodie table object
+   */
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+   
   JavaSparkContext jsc, HoodieTable hoodieTable) {
+JavaPairRDD partitionRecordKeyPairRDD =
+hoodieKeys.mapToPair(key -> new Tuple2<>(key.getPartitionPath(), 
key.getRecordKey()));
+
+// Lookup indexes for all the partition/recordkey pair
+JavaPairRDD recordKeyLocationRDD =
+lookupIndex(partitionRecordKeyPairRDD, jsc, hoodieTable);
+
+JavaPairRDD keyHoodieKeyPairRDD = 
hoodieKeys.mapToPair(key -> new Tuple2<>(key, null));
+
+return 
keyHoodieKeyPairRDD.leftOuterJoin(recordKeyLocationRDD).mapToPair(keyLoc -> {
+  Option> partitionPathFileidPair;
+  if (keyLoc._2._2.isPresent()) {
+partitionPathFileidPair = 
Option.of(Pair.of(keyLoc._1().getPartitionPath(), 
keyLoc._2._2.get().getFileId()));
+  } else {
+partitionPathFileidPair = Option.empty();
+  }
+  return new Tuple2<>(keyLoc._1, partitionPathFileidPair);
+});
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD, JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+
+// Step 0: cache the input record RDD
+if (config.getBloomIndexUseCaching()) {
+  recordRDD.persist(config.getBloomIndexInputStorageLevel());
+}
+
+// Step 1: Extract out thinner JavaPairRDD of (partitionPath, recordKey)
 
 Review comment:
   I have made this leaner @vinothchandar. You can check it out. I am yet to 
modularize bloom index and simple index. But feel free to take a look at the 
core logic in the mean time, since you are working on a diff impl, we might get 
some ideas. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go t

[GitHub] [incubator-hudi] melkimohamed opened a new issue #1439: [SUPPORT] Hudi class loading problem

2020-03-23 Thread GitBox

melkimohamed opened a new issue #1439: [SUPPORT] Hudi class loading problem
URL: https://github.com/apache/incubator-hudi/issues/1439
 
 
   **Describe the problem you faced**
   I tested hudi and everything works fine except the count requests
    The only problem when I do a count (select count (*) from table;), I always 
get the following error message even though the hudi library is loaded.
   ```
   Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable 
to find class: org.apache.hudi.hadoop.HoodieParquetInputFormat
   ```
   
   **Note:** I am able to create hudi tables manually and the count query 
works,the problem only with automatically created tables (HIVE SYNC)
   do you have any idea on the problem of loading lib hudi on hive ?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   1.  Build project (Everything works well)
   I am using HDP 2.6.4 (hive 2.1.0) with HUDI 0.5, I build the project with 
the steps below
   ```
   git clone g...@github.com:apache/incubator-hudi.git
   
rm 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java
   
   mvn clean package -DskipTests -DskipITs -Dhive.version=2.1.0
   ```
In hive.site.xml I added the configuration below
   ```
   hive.reloadable.aux.jars.path=/usr/hudi/hudi-hive-bundle-0.5.0-incubating.jar
   ```
   
   2. Creation of a dataset and synchronized it with hive(Everything works well)
   ```
export SPARK_MAJOR_VERSION=2
spark-shell --conf 
"spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf 
"spark.sql.hive.convertMetastoreParquet=false" --jars 
hdfs://mycluster/libs/hudi-spark-bundle-0.5.0-incubating.jar
   import org.apache.spark.sql.SaveMode
   import org.apache.spark.sql.functions._ 
   import org.apache.hudi.DataSourceWriteOptions 
   import org.apache.hudi.config.HoodieWriteConfig 
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   val inputDataPath = 
"hdfs://mycluster/apps/warehouse/test_acid.db/users_parquet"
   val hudiTableName = "users_cor"
   val hudiTablePath = "hdfs://mycluster/apps/warehouse/" + hudiTableName
   
   val hudiOptions = Map[String,String](
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "year", 
HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
DataSourceWriteOptions.OPERATION_OPT_KEY ->
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "year",
DataSourceWriteOptions.HIVE_URL_OPT_KEY -> 
"jdbc:hive2://:1/;principal=...",
DataSourceWriteOptions.HIVE_USER_OPT_KEY -> "hive",
DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "default",
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> hudiTableName, 
DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "year", 
DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY -> "false",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName
)
   
temp.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath)
   
   val inputDF = spark.read.format("parquet").load(inputDataPath)
   inputDF.write .format("org.apache.hudi").
   options(hudiOptions).
   mode(SaveMode.Overwrite).
   save(hudiTablePath);
   ```
   **==> all work fine**
   
   3. update data (Everything works well)
   ```
   designation="Account Coordinator";
   val requestToUpdate = "Account Executive"
   val sqlStatement = s"SELECT count (*) FROM tdefault.users_cor WHERE 
designation = '$requestToUpdate'"
   spark.sql(sqlStatement).show()
   val updateDF = inputDF.filter(col("designation") === 
requestToUpdate).withColumn("designation", lit("Account Executive"))
   
   updateDF.write.format("org.apache.hudi").
   options(hudiOptions).option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL).
   mode(SaveMode.Append).
   save(hudiTablePath);
   ```
   
   4. DESCRIBE TABLE (Everything works well)
   ```
   DESCRIBE FORMATTED  users_cor;
   ```
   .
   .
   .
   | SerDe Library:| 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
   | InputFormat:  | 
org.apache.hudi.hadoop.HoodieParquetInputFormat 
   | OutputFormat: | 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat 
   
   5.  Count rows (Problem)
   
   ```select count (*) from users_mor;
   
   
20-03-23_16-09-20_722_7229255051541187826-1886/3e2bc38c-1cf9-4d96-b90c-83fd9dd4d277/map.xml:
 org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: 
org.apache.hudi.hadoop.HoodieParquetInputFormat
   Serialization trace:
   inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
   alias

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-23 Thread GitBox

vinothchandar commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r396498871
 
 

 ##
 File path: .travis.yml
 ##
 @@ -31,7 +31,7 @@ after_success:
   - echo $TRAVIS_PULL_REQUEST
   - 'if [ "$TRAVIS_PULL_REQUEST" != "false" ]; then echo "ignore push build 
result for per submit"; exit 0; fi'
   - 'if [ "$TRAVIS_PULL_REQUEST" = "false" ]; then echo "pushing build result 
..."; fi'
-  - \cp -rf ${DOCS_ROOT}/_site/* test-content
+  - mkdir test-content && \cp -rf ${DOCS_ROOT}/_site/* test-content
 
 Review comment:
   @lamber-ken curious why this was needed now


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] s-sanjay commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-03-23 Thread GitBox

s-sanjay commented on a change in pull request #1350: [HUDI-629]: Replace 
Guava's Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r396443518
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/NumericUtils.java
 ##
 @@ -31,4 +38,27 @@ public static String humanReadableByteCount(double bytes) {
 String pre = "KMGTPE".charAt(exp - 1) + "";
 return String.format("%.1f %sB", bytes / Math.pow(1024, exp), pre);
   }
+
+  public static long getMessageDigestHash(final String algorithmName, final 
String string) {
+MessageDigest md;
+try {
+  md = MessageDigest.getInstance(algorithmName);
+} catch (NoSuchAlgorithmException e) {
+  throw new HoodieException(e);
+}
+return 
asLong(Objects.requireNonNull(md).digest(string.getBytes(StandardCharsets.UTF_8)));
+  }
+
+  public static long asLong(byte[] bytes) {
+ValidationUtils.checkState(bytes.length >= 8, "HashCode#asLong() requires 
>= 8 bytes.");
+return padToLong(bytes);
+  }
+
+  public static long padToLong(byte[] bytes) {
+long retVal = (bytes[0] & 0xFF);
+for (int i = 1; i < Math.min(bytes.length, 8); i++) {
 
 Review comment:
   yes I can do it :D
   
   yeah modern JIT will unroll it, if all the time it is called with 
bytes.length > 8 I guess... this was more of a readability thing... 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] gdineshbabu88 commented on issue #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-23 Thread GitBox

gdineshbabu88 commented on issue #1150: [HUDI-288]: Add support for ingesting 
multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#issuecomment-602577700
 
 
   @pratyakshsharma Can you update the wiki for HoodieMultiDeltaStreamer 
similar to 
https://hudi.incubator.apache.org/docs/writing_data.html#deltastreamer?
   
   Can you advise in which release this tool will be available?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] loagosad opened a new issue #1438: How to get the file name corresponding to HoodieKey through the GlobalBloomIndex

2020-03-23 Thread GitBox

loagosad opened a new issue #1438: How to get the file name corresponding to 
HoodieKey through the GlobalBloomIndex 
URL: https://github.com/apache/incubator-hudi/issues/1438
 
 
   I use the `fetchRecordLocation` method in BloomIndex, which returns null, 
but in fact needs to return the corresponding file path.   The test code is as 
follows.
   ```
   @Test
 public void testROViewWithGlobalBloom() {
   List records = new ArrayList<>();
   records.add(new HoodieKey("5bca0f72-8e20-41be-a334-572fafc23201", 
null));
   records.add(new HoodieKey("b8b16798-6ea8-4bde-9440-90ee1e6b2dc2", 
null));
   records.add(new HoodieKey("c77d1f18-7f6b-4aa3-a6fa-64ec6986cba6", 
null));
   records.add(new HoodieKey("33f5a5b1-aefe-4393-9a92-9637759d5072", 
null));
   records.add(new HoodieKey("f2c6138d-6fd5-42ae-983b-8938a5e51ee6", 
null));
   records.add(new HoodieKey("1796b537-0d57-43f5-8d00-3fb1575cd5a8", 
null));
   JavaRDD hoodieKeys = jsc.parallelize(records);
   String basePath = "/data/hoodie_base";
   HoodieWriteConfig clientConfig =
   HoodieWriteConfig.newBuilder()
   .withPath(basePath)
   //  we use HoodieGlobalBloomIndex
   
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.GLOBAL_BLOOM).build())
   .build();
   HoodieTableMetaClient metaClient = new 
HoodieTableMetaClient(jsc.hadoopConfiguration(), basePath, true);
   HoodieTable hoodieTable = HoodieTable.getHoodieTable(metaClient, 
clientConfig, jsc);
   HoodieIndex index = hoodieTable.getIndex();
   long startTime = System.currentTimeMillis();
   JavaPairRDD>> lookupResultRDD 
=
   index.fetchRecordLocation(hoodieKeys, jsc, hoodieTable);
   JavaPairRDD> keyToFileRDD =
   lookupResultRDD.mapToPair(r -> new Tuple2<>(r._1, 
convertToDataFilePath(r._2, hoodieTable)));
   List paths = keyToFileRDD.filter(keyFileTuple -> 
keyFileTuple._2().isPresent())
   .map(keyFileTuple -> new 
Path(keyFileTuple._2().get())).distinct().collect();
   System.out.println(String.format("=PATH=time cost: %d" 
,System.currentTimeMillis()-startTime));
   System.out.println(paths);
   }
   ```
   
   **How can i to get the file name corresponding to HoodieKey through the 
GlobalBloomIndex ?**


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

51 matches

Mail list logo