[GitHub] [incubator-hudi] yanghua commented on issue #1541: [MINOR] Add ability to specify time unit for TimestampBasedKeyGenerator

2020-04-22 Thread GitBox


yanghua commented on issue #1541:
URL: https://github.com/apache/incubator-hudi/pull/1541#issuecomment-618193932


   @afilipchik Firstly, You may need to figure out why the Travis is red.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-787) Implement HoodieGlobalBloomIndexV2

2020-04-22 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-787:
---

Assignee: lamber-ken

> Implement HoodieGlobalBloomIndexV2
> --
>
> Key: HUDI-787
> URL: https://issues.apache.org/jira/browse/HUDI-787
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Index
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Implement HoodieGlobalBloomIndexV2 base on Implement HoodieBloomIndexV2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-821) Fix the wrong annotation of JCommander IStringConverter

2020-04-22 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-821:

Status: Open  (was: New)

> Fix the wrong annotation of JCommander IStringConverter
> ---
>
> Key: HUDI-821
> URL: https://issues.apache.org/jira/browse/HUDI-821
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
>  Labels: pull-request-available
>
> Please refer to https://github.com/cbeust/jcommander/issues/253.
> If you define a list as argument to be parsed with an IStringConverter, 
> JCommander will create a List> instead of a List.
> we should change `converter = TransformersConverter.class` to `converter = 
> StringConverter.class, listConverter = TransformersConverter.class`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-821) Fix the wrong annotation of JCommander IStringConverter

2020-04-22 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed HUDI-821.
---
Resolution: Fixed

> Fix the wrong annotation of JCommander IStringConverter
> ---
>
> Key: HUDI-821
> URL: https://issues.apache.org/jira/browse/HUDI-821
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
>  Labels: pull-request-available
>
> Please refer to https://github.com/cbeust/jcommander/issues/253.
> If you define a list as argument to be parsed with an IStringConverter, 
> JCommander will create a List> instead of a List.
> we should change `converter = TransformersConverter.class` to `converter = 
> StringConverter.class, listConverter = TransformersConverter.class`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-827) Translation error

2020-04-22 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-827.
-
Resolution: Fixed

> Translation error
> -
>
> Key: HUDI-827
> URL: https://issues.apache.org/jira/browse/HUDI-827
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: docs-chinese
>Affects Versions: 0.5.2
>Reporter: Lisheng Wang
>Assignee: Lisheng Wang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> found translation error in 
> [https://hudi.apache.org/cn/docs/writing_data.html], 
> "如优化文件大小之类后", should be "如优化文件大小之后" 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


vinothchandar commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618180277


   @harshi2506 I am suspecting this may be due to a recent bug we fixed on 
master (still not 100%). Are you open to building hudi off master branch and 
giving that a shot? I am suspecting #1394 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on issue #1549: Potential issue when using Deltastreamer with DMS

2020-04-22 Thread GitBox


vinothchandar commented on issue #1549:
URL: https://github.com/apache/incubator-hudi/issues/1549#issuecomment-618179797


   
   > When I look into the specific S3 folder, I see that the insert and delete 
into the partition actually  create a new .parquet file with no log file.
   
   So inserts in MOR still go to a parquet file. only updates go to a log file 
(merging is much more expensive since it reads, merges and write parquet, than 
just writing parquet). So what you saw is expected behavior. 
   
   
   > it does not reflect the deletes.Based on my understanding, the _rt table 
should reflect the deletes
   
   True.. it should reflect the deletes. MOR would have logged a delete block 
into the log file and the keys should be listed there.. Do you know the log 
files? if so, you can use the CLI and see what's inside the logs, there is a 
command to inspect the log file there.. 
   
   Happy to get this ironed out.. 
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on issue #1540: [HUDI-819] Fix a bug with MergeOnReadLazyInsertIterable.

2020-04-22 Thread GitBox


vinothchandar commented on issue #1540:
URL: https://github.com/apache/incubator-hudi/pull/1540#issuecomment-618175675


   @satishkotha let's then break that up into a separate JIRA (tagged with Code 
Cleanup component). We can limit scope to these insert related handles and move 
on.. wdyt 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vingov commented on issue #1526: [HUDI-1526] Add pyspark example in quickstart

2020-04-22 Thread GitBox


vingov commented on issue #1526:
URL: https://github.com/apache/incubator-hudi/pull/1526#issuecomment-618173541


   @EdwinGuo - Thanks for trying multiple options, I thought it would be easy 
to create those language-specific tabs, if it too much work, we can create a 
separate page for pyspark code.
   
   If we were using docusaurus, it would be super easy to create these tab 
navigation - 
https://docusaurus.io/docs/en/doc-markdown#language-specific-code-tabs
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] allenzhg opened a new issue #1555: [SUPPORT]

2020-04-22 Thread GitBox


allenzhg opened a new issue #1555:
URL: https://github.com/apache/incubator-hudi/issues/1555


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   Caused by: java.lang.IllegalAccessError: class 
org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface 
org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
   
   A clear and concise description of the problem.
   
   hudi can be support kerberos auth?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1../hudi-cli.sh
   2.connect --path hdfs://HDFS/user/hadoop/hudi_trips_cow
   3.commit rollback --commit 20200423113212
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :+1:  hudi-spark_2.11-0.6.0
   
   * Spark version :spark_2.4.3
   
   * Hive version : 3.1.1
   
   * Hadoop version : 3.1.2
   
   * Storage (HDFS/S3/GCS..) :  hdfs
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #256

2020-04-22 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.35 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[jira] [Created] (HUDI-834) Concrete signature of HoodieRecordPayload#combineAndGetUpdateValue & HoodieRecordPayload#getInsertValue

2020-04-22 Thread Zili Chen (Jira)
Zili Chen created HUDI-834:
--

 Summary: Concrete signature of 
HoodieRecordPayload#combineAndGetUpdateValue & 
HoodieRecordPayload#getInsertValue
 Key: HUDI-834
 URL: https://issues.apache.org/jira/browse/HUDI-834
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Zili Chen


So far, the return type of {{HoodieRecordPayload#combineAndGetUpdateValue}} & 
{{HoodieRecordPayload#getInsertValue}} is effectively 
{{Option}}. Instead of doing unchecked cast at

org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java:88

I propose we use {{Option}} as the return type of these two 
method, which replaces current {{Option}}.

FYI, I encounter this ticket when trying to get rid of self type parameter in  
{{HoodieRecordPayload}} and found that it is a bit awkward if we don't take a 
self type while doing this casting. Fortunately it is the fact that we can 
directly concrete it.

cc [~vinoth] [~leesf]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-821) Fix the wrong annotation of JCommander IStringConverter

2020-04-22 Thread dengziming (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090198#comment-17090198
 ] 

dengziming commented on HUDI-821:
-

please close it since it has been resolved by another commit

> Fix the wrong annotation of JCommander IStringConverter
> ---
>
> Key: HUDI-821
> URL: https://issues.apache.org/jira/browse/HUDI-821
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
>  Labels: pull-request-available
>
> Please refer to https://github.com/cbeust/jcommander/issues/253.
> If you define a list as argument to be parsed with an IStringConverter, 
> JCommander will create a List> instead of a List.
> we should change `converter = TransformersConverter.class` to `converter = 
> StringConverter.class, listConverter = TransformersConverter.class`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] wangxianghu commented on issue #1224: [HUDI-397] Normalize log print statement

2020-04-22 Thread GitBox


wangxianghu commented on issue #1224:
URL: https://github.com/apache/incubator-hudi/pull/1224#issuecomment-618142920


   hi @n3nash, I have rebased this pr PTAL, thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] hddong opened a new pull request #1554: [HUDI-704]Add test for RepairsCommand

2020-04-22 Thread GitBox


hddong opened a new pull request #1554:
URL: https://github.com/apache/incubator-hudi/pull/1554


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Add test for RepairsCommand in hudi-cli module*
   
   ## Brief change log
   
 - *Add test for RepairsCommand*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-704) Add unit test for RepairsCommand

2020-04-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-704:

Labels: pull-request-available  (was: )

> Add unit test for RepairsCommand
> 
>
> Key: HUDI-704
> URL: https://issues.apache.org/jira/browse/HUDI-704
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] wanglisheng81 commented on issue #1551: [HUDI-827] fix translation error

2020-04-22 Thread GitBox


wanglisheng81 commented on issue #1551:
URL: https://github.com/apache/incubator-hudi/pull/1551#issuecomment-618132796


   > Thanks for your contributing @wanglisheng81 ! LGTM, merging.
   > PS: it would be a MINOR change without a jira issue. :) anyway, thanks for 
the contribution.
   
   Got it, @leesf, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] c-f-cooper commented on issue #143: Tracking ticket for folks to be added to slack group

2020-04-22 Thread GitBox


c-f-cooper commented on issue #143:
URL: https://github.com/apache/incubator-hudi/issues/143#issuecomment-618121217


   > @c-f-cooper @superguhua @jenu9417 @dahirainbow welcome and done
   
   thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on issue #1044: [HUDI-361] Implement CSV metrics reporter

2020-04-22 Thread GitBox


lamber-ken commented on issue #1044:
URL: https://github.com/apache/incubator-hudi/pull/1044#issuecomment-618120915


   hi @XuQianJin-Stars, thanks for your contribution, CSV metrices reporter 
seems not popular in product, so I suggest closing this, WDYT? @leesf 
   
   also, flink doesn't use CSV reporter
   
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#reporter
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] wangxianghu commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-04-22 Thread GitBox


wangxianghu commented on issue #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-618119493


   @n3nash well done !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] n3nash edited a comment on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-04-22 Thread GitBox


n3nash edited a comment on issue #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-618113749


   @yanghua @bvaradar @vinothchandar I've fixed this PR since it was failing 
builds due to multiple pom issues, I've rebased the code from the last time we 
did this (lots of code has changed naturally) and cleaned up some of the code. 
   
   This test suite now has test cases that one can use to run end to end tests 
in junit. At the moment, the test suite does not run in docker due to Spark 2.4 
bringing in Hive 1.x dependencies and our code using Hive 2.x to spin up local 
hive server nodes. There is a hacky approach that we are using at Uber which 
will be upstreamed in the next couple of weeks 
(https://issues.apache.org/jira/browse/HUDI-830) and as part of that we can 
discuss how to solve it (the right way to solve without any hacks is to move to 
spark 3.x since they upgraded the Hive libs there but that might take a while 
etc). Once that is done, we can even run this as an integration test.
   
   The following is my suggestion for a plan for this : 
   1) Land this PR which provides an initial test suite to test basic end to 
end functionality. This allows folks to atleast start using this framework to 
test large PRs but spending minimal amount of time to enhance the test suite 
(especially since lots of refactoring is happening). Last I checked, @yanghua I 
think you went over the PR and were okay to merge. @vinothchandar @bvaradar 
unless we have major concerns on the PR, we can merge it and then take 
incremental pr's
   2) There are many tickets under HUDI-289, all the enhancements to the test 
suite from Uber will follow in subsequent PR's in the next 2-3 weeks. I'm 
working with the necessary folks along with some of the enhancements that I 
want to do in the coming weeks.
   3) @yanghua need you to lead the Azure pipelines for the test suite and 
other tickets assigned to you under the umbrella ticket.
   
   Additionally, since the test suite actually tests all end to end 
functionality, the main class tests take a while to run. Also, without the test 
suite we are nearing the max time allowed to run unit tests on travis (50 
mins), so I've separated out the test suite unit tests into a different job, 
just like the integration tests.  
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-289) Implement a test suite to support long running test for Hudi writing and querying end-end

2020-04-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-289:

Labels: pull-request-available  (was: )

> Implement a test suite to support long running test for Hudi writing and 
> querying end-end
> -
>
> Key: HUDI-289
> URL: https://issues.apache.org/jira/browse/HUDI-289
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> We would need an equivalent of an end-end test which runs some workload for 
> few hours atleast, triggers various actions like commit, deltacopmmit, 
> rollback, compaction and ensures correctness of code before every release
> P.S: Learn from all the CSS issues managing compaction..
> The feature branch is here: 
> [https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] n3nash commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-04-22 Thread GitBox


n3nash commented on issue #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-618113749


   @yanghua @bvaradar @vinothchandar I've fixed this PR, cleaned up a bunch of 
the code. This test suite now has test cases that one can use to run end to end 
tests in junit.
   
   At the moment, the test suite does not run in docker due to Spark 2.4 
bringing in Hive 1.x dependencies and our code using Hive 2.x to spin up local 
hive server nodes. There is a hacky approach that we are using at Uber which 
will be upstreamed in the next couple of weeks 
(https://issues.apache.org/jira/browse/HUDI-830) and as part of that we can 
discuss how to solve it (the right way to solve without any hacks is to move to 
spark 3.x since they upgraded the Hive libs there but that might take a while 
etc). Once that is done, we can even run this as an integration test.
   
   The following is my suggestion for a plan for this : 
   1) Land this PR which provides an initial test suite to test basic end to 
end functionality. This allows folks to atleast start using this framework to 
test large PRs but spending minimal amount of time to enhance the test suite 
(especially since lots of refactoring is happening). Last I checked, @yanghua I 
think you went over the PR and were okay to merge. @vinothchandar @bvaradar 
unless we have major concerns on the PR, we can merge it and then take 
incremental pr's
   2) There are many tickets under HUDI-289, all the enhancements to the test 
suite from Uber will follow in subsequent PR's in the next 2-3 weeks. I'm 
working with the necessary folks along with some of the enhancements that I 
want to do in the coming weeks.
   3) @yanghua need you to lead the Azure pipelines for the test suite and 
other tickets assigned to you under the umbrella ticket.
   
   Additionally, since the test suite actually tests all end to end 
functionality, the main class tests take a while to run. Also, without the test 
suite we are nearing the max time allowed to run unit tests on travis (50 
mins), so I've separated out the test suite unit tests into a different job, 
just like the integration tests.  
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-833) Allow test suite to change hudi write configs for any dag node

2020-04-22 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-833:


 Summary: Allow test suite to change hudi write configs for any dag 
node
 Key: HUDI-833
 URL: https://issues.apache.org/jira/browse/HUDI-833
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Testing
Reporter: Nishith Agarwal
Assignee: satish






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-832) Add ability to induce failures to catch issues

2020-04-22 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-832:


 Summary: Add ability to induce failures to catch issues
 Key: HUDI-832
 URL: https://issues.apache.org/jira/browse/HUDI-832
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Testing
Reporter: Nishith Agarwal
Assignee: Abhishek Modi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-831) Augment the existing DAG with more complex use-cases seen at Uber

2020-04-22 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-831:


 Summary: Augment the existing DAG with more complex use-cases seen 
at Uber
 Key: HUDI-831
 URL: https://issues.apache.org/jira/browse/HUDI-831
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Testing
Reporter: Nishith Agarwal
Assignee: Abhishek Modi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-830) Fix issues related to running the test suite in docker due to Hive 2.x

2020-04-22 Thread Nishith Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-830:


Assignee: Abhishek Modi

> Fix issues related to running the test suite in docker due to Hive 2.x
> --
>
> Key: HUDI-830
> URL: https://issues.apache.org/jira/browse/HUDI-830
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Nishith Agarwal
>Assignee: Abhishek Modi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-830) Fix issues related to running the test suite in docker due to Hive 2.x

2020-04-22 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-830:


 Summary: Fix issues related to running the test suite in docker 
due to Hive 2.x
 Key: HUDI-830
 URL: https://issues.apache.org/jira/browse/HUDI-830
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Testing
Reporter: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-829) Efficiently reading hudi tables through spark-shell

2020-04-22 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090098#comment-17090098
 ] 

Udit Mehrotra edited comment on HUDI-829 at 4/23/20, 12:07 AM:
---

You may also want to look at my implementation of custom relation in Spark to 
read bootstrapped tables 
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
 . Here I am building my own file index using spark's InMemoryFileIndex, but 
the filtering part is just one operation now because once I have all the files, 
the hudi filesystem view is created just once to get latest files. Its still 
work in progress and I am yet to see how fast this is going to be. We can 
consider moving to a place where our reading in spark happens through our 
relations, and underneath we use the native readers.

In practice the above data source implementation I am working on would work for 
both bootstrapped and non-bootstrapped tables as well. So once this is working 
and stable would be nice to do some benchmarking for regular (non-bootstrapped) 
tables as well and compare its read performance against standard parquet data 
source with filters.


was (Author: uditme):
You may also want to look at my implementation of custom relation in Spark to 
read bootstrapped tables 
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
 . Here I am building my own file index using spark's InMemoryFileIndex, but 
the filtering part is just one operation now because once I have all the files, 
the hudi filesystem view is created just once to get latest files. Its still 
work in progress and I am yet to see how fast this is going to be. We can 
consider moving to a place where our reading in spark happens through our 
relations, and underneath we use the native readers.

> Efficiently reading hudi tables through spark-shell
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables (even with 
> spark.sql.hive.convertMetastoreParquet which can give you the same data 
> reading speed and all the vectorization benefits). Is this slowness observed 
> during spark query planning ? Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-829) Efficiently reading hudi tables through spark-shell

2020-04-22 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090098#comment-17090098
 ] 

Udit Mehrotra edited comment on HUDI-829 at 4/22/20, 11:59 PM:
---

You may also want to look at my implementation of custom relation in Spark to 
read bootstrapped tables 
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
 . Here I am building my own file index using spark's InMemoryFileIndex, but 
the filtering part is just one operation now because once I have all the files, 
the hudi filesystem view is created just once to get latest files. Its still 
work in progress and I am yet to see how fast this is going to be. We can 
consider moving to a place where our reading in spark happens through our 
relations, and underneath we use the native readers.


was (Author: uditme):
You may also want to look at my implementation of custom relation in Spark to 
read bootstrapped tables 
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
 . Here I am building my own file index using spark's InMemoryFileIndex, but 
the filtering part is just one operation now because once I have all the files, 
the hudi filesystem view just once to get latest files. Its still work in 
progress and I am yet to see how fast this is going to be. We can consider 
moving to a place where our reading in spark happens through our relations, and 
underneath we use the native readers.

> Efficiently reading hudi tables through spark-shell
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables (even with 
> spark.sql.hive.convertMetastoreParquet which can give you the same data 
> reading speed and all the vectorization benefits). Is this slowness observed 
> during spark query planning ? Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-829) Efficiently reading hudi tables through spark-shell

2020-04-22 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090098#comment-17090098
 ] 

Udit Mehrotra commented on HUDI-829:


You may also want to look at my implementation of custom relation in Spark to 
read bootstrapped tables 
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
 . Here I am building my own file index using spark's InMemoryFileIndex, but 
the filtering part is just one operation now because once I have all the files, 
the hudi filesystem view just once to get latest files. Its still work in 
progress and I am yet to see how fast this is going to be. We can consider 
moving to a place where our reading in spark happens through our relations, and 
underneath we use the native readers.

> Efficiently reading hudi tables through spark-shell
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables (even with 
> spark.sql.hive.convertMetastoreParquet which can give you the same data 
> reading speed and all the vectorization benefits). Is this slowness observed 
> during spark query planning ? Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-829) Efficiently reading hudi tables through spark-shell

2020-04-22 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090095#comment-17090095
 ] 

Udit Mehrotra commented on HUDI-829:


[~nishith29] Thanks for creating the ticket. So the issue I was talking about 
happens becuase of 
[HoodieROTablePathFilter|https://github.com/apache/incubator-hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java]
 which we pass to spark to filter out the latest commit files for copy on write 
use-case. When we create the *parquet* datasource underneath for reading, Spark 
uses 
[InMemoryFileIndex|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala]
 to maintain an index with list of files to read from the path passed to the 
data source. Here while the listing of files is parallelized by Spark, the 
filter however is applied sequentially on the all the file one by one 
[here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L376].
 Here also there are some caveats, if the number of paths to be listed < 
*sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold* Spark 
optimizes to do serial listing and the filters are all evaluated sequentially 
across all partitions at the driver. If however its > 
*sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold* then 
listing across paths is parallelized and filter is applied sequentially at the 
executors but still sequentially for each file in the path. We have seen this 
as being a bottleneck with S3 atleast. While ideally this is something that can 
improved inside of spark by parallelizing the filtering further, however at 
this point I need to dive further into filter code to understand if there is 
something that can be done to make it faster from Hudi side.

> Efficiently reading hudi tables through spark-shell
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables (even with 
> spark.sql.hive.convertMetastoreParquet which can give you the same data 
> reading speed and all the vectorization benefits). Is this slowness observed 
> during spark query planning ? Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch hudi_test_suite_refactor updated (388842c -> da3232e)

2020-04-22 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard 388842c  Testing running 3 builds to limit total build time
 add da3232e  Testing running 3 builds to limit total build time

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (388842c)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (da3232e)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 ...iveSyncDagGenerator.java => HiveSyncDagGeneratorMOR.java} | 11 +--
 .../apache/hudi/testsuite/job/TestHoodieTestSuiteJob.java| 12 ++--
 2 files changed, 15 insertions(+), 8 deletions(-)
 copy 
hudi-test-suite/src/test/java/org/apache/hudi/testsuite/dag/{HiveSyncDagGenerator.java
 => HiveSyncDagGeneratorMOR.java} (94%)



[GitHub] [incubator-hudi] nsivabalan commented on issue #1402: [WIP][HUDI-407] Adding Simple Index

2020-04-22 Thread GitBox


nsivabalan commented on issue #1402:
URL: https://github.com/apache/incubator-hudi/pull/1402#issuecomment-618094863


   @vinothchandar : I am done for the most part. Just that unit tests are 
failing in travis, but succeeds locally. I need to fix that. but the patch in 
general is good to be reviewed. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] harshi2506 edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


harshi2506 edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618068101


   @vinothchandar yes, it is happening every time, I tried it 3 times and its 
always the same. I see a hoodie commit being added in atmost 5-6 minutes and 
then it takes almost an hour to complete the EMR step. I am using AWS EMR, HUDI 
Version: 0.5.0-incubating.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] satishkotha commented on issue #1540: [HUDI-819] Fix a bug with MergeOnReadLazyInsertIterable.

2020-04-22 Thread GitBox


satishkotha commented on issue #1540:
URL: https://github.com/apache/incubator-hudi/pull/1540#issuecomment-618092361


   > +1 on the change itself.. For completeness let's also bring MergeHandle 
under the same factory implementation?
   
   I tried to fit in MergeHandle. But it doesn't seem to fit the pattern 
because constructor args requires recordItr at construction time. Other handles 
don't deal with recordItr directly. LazyInsertIterable deals with recordItr for 
these handles.  
   
   We need to make major changes to move recordItr from MergeHandle into 
UpdateHandler to fit it into the pattern. If you agree this is the right 
approach, I can implement it.
   
   What do you think? Let me know if you have other suggestions. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r413398227



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieTimer.java
##
@@ -69,4 +76,13 @@ public long endTimer() {
 }
 return timeInfoDeque.pop().stop();
   }
+
+  public long deltaTime() {

Review comment:
   hi @vinothchandar @nsivabalan, you're all think it's unnecessary, 
willing to take your suggestion. But from my side, it's better to use 
`deltaTime()`, will address it : )





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-829) Efficiently reading hudi tables through spark-shell

2020-04-22 Thread Nishith Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-829:
-
Summary: Efficiently reading hudi tables through spark-shell  (was: Reading 
Hudi tables through spark-shell is slow (even with 
spark.sql.hive.convertMetastoreParquet) )

> Efficiently reading hudi tables through spark-shell
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables. Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-829) Efficiently reading hudi tables through spark-shell

2020-04-22 Thread Nishith Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-829:
-
Description: 
[~uditme] Created this ticket to track some discussion on read/query path of 
spark with Hudi tables. 

My understanding is that when you read Hudi tables through spark-shell, some of 
your queries are slower due to some sequential activity performed by spark when 
interacting with Hudi tables (even with spark.sql.hive.convertMetastoreParquet 
which can give you the same data reading speed and all the vectorization 
benefits). Is this slowness observed during spark query planning ? Can you 
please elaborate on this ? 

  was:
[~uditme] Created this ticket to track some discussion on read/query path of 
spark with Hudi tables. 

My understanding is that when you read Hudi tables through spark-shell, some of 
your queries are slower due to some sequential activity performed by spark when 
interacting with Hudi tables. Can you please elaborate on this ? 


> Efficiently reading hudi tables through spark-shell
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables (even with 
> spark.sql.hive.convertMetastoreParquet which can give you the same data 
> reading speed and all the vectorization benefits). Is this slowness observed 
> during spark query planning ? Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-829) Reading Hudi tables through spark-shell slow (even with spark.sql.hive.convertMetastoreParquet)

2020-04-22 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-829:


 Summary: Reading Hudi tables through spark-shell slow (even with 
spark.sql.hive.convertMetastoreParquet) 
 Key: HUDI-829
 URL: https://issues.apache.org/jira/browse/HUDI-829
 Project: Apache Hudi (incubating)
  Issue Type: Task
  Components: Spark Integration
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal


[~uditme] Created this ticket to track some discussion on read/query path of 
spark with Hudi tables. 

My understanding is that when you read Hudi tables through spark-shell, some of 
your queries are slower due to some sequential activity performed by spark when 
interacting with Hudi tables. Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-829) Reading Hudi tables through spark-shell is slow (even with spark.sql.hive.convertMetastoreParquet)

2020-04-22 Thread Nishith Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-829:
-
Summary: Reading Hudi tables through spark-shell is slow (even with 
spark.sql.hive.convertMetastoreParquet)   (was: Reading Hudi tables through 
spark-shell slow (even with spark.sql.hive.convertMetastoreParquet) )

> Reading Hudi tables through spark-shell is slow (even with 
> spark.sql.hive.convertMetastoreParquet) 
> ---
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables. Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io removed a comment on issue #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


codecov-io removed a comment on issue #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#issuecomment-618028443


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=h1) 
Report
   > Merging 
[#1469](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/a464a2972e83e648585277ebe567703c8285cf1e=desc)
 will **decrease** coverage by `0.40%`.
   > The diff coverage is `86.93%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1469/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1469  +/-   ##
   
   - Coverage 72.20%   71.79%   -0.41% 
   - Complexity  289  294   +5 
   
 Files   338  381  +43 
 Lines 1594616697 +751 
 Branches   1624 1682  +58 
   
   + Hits  1151411988 +474 
   - Misses 3704 3969 +265 
   - Partials728  740  +12 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...java/org/apache/hudi/config/HoodieIndexConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZUluZGV4Q29uZmlnLmphdmE=)
 | `61.53% <0.00%> (-1.96%)` | `0.00 <0.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/index/HoodieIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSG9vZGllSW5kZXguamF2YQ==)
 | `83.33% <50.00%> (-4.91%)` | `0.00 <0.00> (ø)` | |
   | 
[...org/apache/hudi/io/HoodieBloomRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllQmxvb21SYW5nZUluZm9IYW5kbGUuamF2YQ==)
 | `81.81% <81.81%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...rg/apache/hudi/index/bloom/HoodieBloomIndexV2.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleFYyLmphdmE=)
 | `86.20% <86.20%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[.../apache/hudi/io/HoodieKeyAndBloomLookupHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllS2V5QW5kQmxvb21Mb29rdXBIYW5kbGUuamF2YQ==)
 | `90.00% <90.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `84.97% <100.00%> (+0.60%)` | `0.00 <0.00> (ø)` | |
   | 
[...udi/index/bloom/HoodieBloomIndexCheckFunction.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleENoZWNrRnVuY3Rpb24uamF2YQ==)
 | `88.23% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/io/HoodieKeyLookupHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllS2V5TG9va3VwSGFuZGxlLmphdmE=)
 | `100.00% <100.00%> (+10.00%)` | `0.00 <0.00> (ø)` | |
   | 
[.../java/org/apache/hudi/common/util/HoodieTimer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvSG9vZGllVGltZXIuamF2YQ==)
 | `80.95% <100.00%> (+7.61%)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `61.62% <0.00%> (-27.66%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [91 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=footer).
 Last update 

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r413395407



##
File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java
##
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.LazyIterableIterator;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.io.HoodieBloomRangeInfoHandle;
+import org.apache.hudi.io.HoodieKeyLookupHandle;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+/**
+ * Simplified re-implementation of {@link HoodieBloomIndex} that does not rely 
on caching, or
+ * incurs the overhead of auto-tuning parallelism.
+ */
+public class HoodieBloomIndexV2 extends 
HoodieIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieBloomIndexV2.class);
+
+  public HoodieBloomIndexV2(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD,
+  JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+return recordRDD
+.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), 
record.getRecordKey()),
+true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyRangeBloomChecker(itr, 
hoodieTable)).flatMap(List::iterator)
+.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
+.filter(Option::isPresent)
+.map(Option::get);
+  }
+
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+  JavaSparkContext jsc, HoodieTable hoodieTable) {
+throw new UnsupportedOperationException("Not yet implemented");
+  }
+
+  @Override
+  public boolean rollbackCommit(String commitTime) {
+// Nope, don't need to do anything.
+return true;
+  }
+
+  /**
+   * This is not global, since we depend on the partitionPath to do the lookup.
+   */
+  @Override
+  public boolean isGlobal() {
+return false;
+  }
+
+  /**
+   * No indexes into log files yet.
+   */
+  @Override
+  public boolean canIndexLogFiles() {
+return false;
+  }
+
+  /**
+   * Bloom filters are stored, into the same data files.
+   */
+  @Override
+  public boolean isImplicitWithStorage() {
+return true;
+  }
+
+  @Override
+  public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
+   HoodieTable hoodieTable) {
+return writeStatusRDD;
+  }
+
+  /**
+   * Given an iterator of hoodie records, returns a pair of candidate 
HoodieRecord, FileID pairs.
+   */
+  class LazyRangeBloomChecker extends
+  LazyIterableIterator, List, 
String>>> {
+
+private HoodieTable table;
+private String 

[GitHub] [incubator-hudi] lamber-ken commented on issue #1526: [HUDI-1526] Add pyspark example in quickstart

2020-04-22 Thread GitBox


lamber-ken commented on issue #1526:
URL: https://github.com/apache/incubator-hudi/pull/1526#issuecomment-618086091


   > @vingov I had copied the 
https://github.com/apache/spark/blob/15462e1a8fa8da54ac51f4d21f567f3c288e6701/docs/js/main.js
 and reference the js lib like: 
https://raw.githubusercontent.com/apache/spark/46be1e01e977788f00f1f5aa8d64bc5f191bc578/docs/sql-getting-started.md
 but I got no luck. Maybe re-orgnize the page with spark-shell code snippet on 
the top of the quick start guide and pyspark on the bottom of the page? 
@lamber-ken what do you think?
   
   First of all, this is a good contribution , it's not a good idea to just 
copy the `main.js`.
   OK, If you have any difficulty doing this, pull this into a separate page is 
better. :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on issue #1546: Issue - Table Read fails in Spark Submit , Where as succeeds in spark-shell

2020-04-22 Thread GitBox


lamber-ken commented on issue #1546:
URL: https://github.com/apache/incubator-hudi/issues/1546#issuecomment-618082468


   > over to you @lamber-ken :)
   
   Loving to accept it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] harshi2506 edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


harshi2506 edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618070975


   hi @lamber-ken, I am already setting parallelism to 200
   `hudiOptions = (HoodieWriteConfig.TABLE_NAME -> "table_name",
 DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKey",
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "dedupeKey",
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> 
"partition_key"(date in format(/mm/dd)),
 "hoodie.upsert.shuffle.parallelism" ->"200",
 DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   
inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(load_path)`
   I will try setting the spark executor instances. Thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] harshi2506 edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


harshi2506 edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618070975


   hi @lamber-ken, I am already setting parallelism to 200
   hudiOptions = (HoodieWriteConfig.TABLE_NAME -> "table_name",
 DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKey",
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "dedupeKey",
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> 
"partition_key"(date in format(/mm/dd)),
 "hoodie.upsert.shuffle.parallelism" ->"200",
 DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   
inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(load_path)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] harshi2506 edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


harshi2506 edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618070975


   hi @lamber-ken, I am already setting parallelism to 200
   hudiOptions += (HoodieWriteConfig.TABLE_NAME -> "table_name",
 DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKey",
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "dedupeKey",
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> 
"partition_key"(date in format(/mm/dd)),
 "hoodie.upsert.shuffle.parallelism" ->"200",
 DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   
inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(load_path)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] harshi2506 commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


harshi2506 commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618070975


   hi @lamber-ken, I am already setting parallelism to 200
   hudiOptions += (HoodieWriteConfig.TABLE_NAME -> "table_name",
 DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKey",
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "dedupeKey",
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> 
"partition_key"(date in format(/mm/dd)),
 "hoodie.upsert.shuffle.parallelism" ->"200",
 DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   
inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(s"$s3Location")



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] harshi2506 commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


harshi2506 commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618068101


   @vinothchandar yes, it is happening every time, I tried it 3 times and its 
always the same. I see a hoodie commit being added in atmost 5-6 minutes and 
then it takes almost an hour to complete all the steps. I am using AWS EMR, 
HUDI Version: 0.5.0-incubating.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on issue #1546: Issue - Table Read fails in Spark Submit , Where as succeeds in spark-shell

2020-04-22 Thread GitBox


vinothchandar commented on issue #1546:
URL: https://github.com/apache/incubator-hudi/issues/1546#issuecomment-618035770


   over to you @lamber-ken :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on issue #1546: Issue - Table Read fails in Spark Submit , Where as succeeds in spark-shell

2020-04-22 Thread GitBox


vinothchandar commented on issue #1546:
URL: https://github.com/apache/incubator-hudi/issues/1546#issuecomment-618035688


   Hmmm fairly certain this is a packaging issue and some kind of class 
mismatch between hudi's hive and spark's hive? 
   
   Our docker setup uses spark 2.3 with Hive 2.x and does all that you are 
doing.. So if I have to guess, I think it may be due to Hive 1.x? are you able 
to quickly retest with Hive 2.x? 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on issue #1433: [HUDI-728]: Implement custom key generator

2020-04-22 Thread GitBox


vinothchandar commented on issue #1433:
URL: https://github.com/apache/incubator-hudi/pull/1433#issuecomment-618031058


   @pratyakshsharma sorry .. fell off my radar since I was not an assignee.. I 
do have some concerns.. Review coming by your day time :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-828) Open Questions before merging Bootstrap

2020-04-22 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090026#comment-17090026
 ] 

Balaji Varadarajan edited comment on HUDI-828 at 4/22/20, 8:44 PM:
---

HBase Shading : We are shading hbase in all hudi bundles. A side-effect of this 
is Key classes that Hfile tracks in its files is also shaded. 

Example : org.apache.hudi.org.apache.hadoop.hbase.KeyValue$KeyComparator

 THe implication here is that bootstrap index Hfiles stores these class names 
which will make it only readable by Hudi bundles

 


was (Author: vbalaji):
HBase Shading : We are shading hbase in all hudi bundles. A side-effect of this 
is Key classes that Hfile tracks in its files is also shaded. 

Example : org.apache.hudi.org.apache.hadoop.hbase.KeyValue$KeyComparator

 

> Open Questions before merging Bootstrap 
> 
>
> Key: HUDI-828
> URL: https://issues.apache.org/jira/browse/HUDI-828
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Priority: Major
>
> This ticket tracks open questions that needs to be resolved before we checkin 
> bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-828) Open Questions before merging Bootstrap

2020-04-22 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090026#comment-17090026
 ] 

Balaji Varadarajan commented on HUDI-828:
-

HBase Shading : We are shading hbase in all hudi bundles. A side-effect of this 
is Key classes that Hfile tracks in its files is also shaded. 

Example : org.apache.hudi.org.apache.hadoop.hbase.KeyValue$KeyComparator

 

> Open Questions before merging Bootstrap 
> 
>
> Key: HUDI-828
> URL: https://issues.apache.org/jira/browse/HUDI-828
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Priority: Major
>
> This ticket tracks open questions that needs to be resolved before we checkin 
> bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1550: Hudi 0.5.2 inability save complex type with nullable = true [SUPPORT]

2020-04-22 Thread GitBox


vinothchandar commented on issue #1550:
URL: https://github.com/apache/incubator-hudi/issues/1550#issuecomment-618028676


   @badion This does seem directly related to the complex types issue fixed 
recently.. 0.5.1-2 we moved out of databricks-avro and to spark-avro and this 
seems like a miss. 
   
   Are you interested in a custom patch for this on top of 0.5.2? Not sure I 
follow the last sentence.. Please clarify, happy to get this moving along for 
you.. 
   
   cc @umehrot2 @zhedoubushishi as well to chime in 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-io commented on issue #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


codecov-io commented on issue #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#issuecomment-618028443


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=h1) 
Report
   > Merging 
[#1469](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/a464a2972e83e648585277ebe567703c8285cf1e=desc)
 will **decrease** coverage by `0.40%`.
   > The diff coverage is `86.93%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1469/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1469  +/-   ##
   
   - Coverage 72.20%   71.79%   -0.41% 
   - Complexity  289  294   +5 
   
 Files   338  381  +43 
 Lines 1594616697 +751 
 Branches   1624 1682  +58 
   
   + Hits  1151411988 +474 
   - Misses 3704 3969 +265 
   - Partials728  740  +12 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...java/org/apache/hudi/config/HoodieIndexConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZUluZGV4Q29uZmlnLmphdmE=)
 | `61.53% <0.00%> (-1.96%)` | `0.00 <0.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/index/HoodieIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSG9vZGllSW5kZXguamF2YQ==)
 | `83.33% <50.00%> (-4.91%)` | `0.00 <0.00> (ø)` | |
   | 
[...org/apache/hudi/io/HoodieBloomRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllQmxvb21SYW5nZUluZm9IYW5kbGUuamF2YQ==)
 | `81.81% <81.81%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...rg/apache/hudi/index/bloom/HoodieBloomIndexV2.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleFYyLmphdmE=)
 | `86.20% <86.20%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[.../apache/hudi/io/HoodieKeyAndBloomLookupHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllS2V5QW5kQmxvb21Mb29rdXBIYW5kbGUuamF2YQ==)
 | `90.00% <90.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `84.97% <100.00%> (+0.60%)` | `0.00 <0.00> (ø)` | |
   | 
[...udi/index/bloom/HoodieBloomIndexCheckFunction.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleENoZWNrRnVuY3Rpb24uamF2YQ==)
 | `88.23% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/io/HoodieKeyLookupHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllS2V5TG9va3VwSGFuZGxlLmphdmE=)
 | `100.00% <100.00%> (+10.00%)` | `0.00 <0.00> (ø)` | |
   | 
[.../java/org/apache/hudi/common/util/HoodieTimer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvSG9vZGllVGltZXIuamF2YQ==)
 | `80.95% <100.00%> (+7.61%)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `61.62% <0.00%> (-27.66%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [91 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1469/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1469?src=pr=footer).
 Last update 

[GitHub] [incubator-hudi] vinothchandar commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


vinothchandar commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618024238


   @harshi2506 the pause is weird.. As you can see the wall clock time itself 
is small.. i.e if you subtract the time lost pausing.. 
   
   Is this reproducible? i.e happens every time? What version of hudi are you 
running?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-828) Open Questions before merging Bootstrap

2020-04-22 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-828:
---

 Summary: Open Questions before merging Bootstrap 
 Key: HUDI-828
 URL: https://issues.apache.org/jira/browse/HUDI-828
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: Balaji Varadarajan


This ticket tracks open questions that needs to be resolved before we checkin 
bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-828) Open Questions before merging Bootstrap

2020-04-22 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-828:

Status: Open  (was: New)

> Open Questions before merging Bootstrap 
> 
>
> Key: HUDI-828
> URL: https://issues.apache.org/jira/browse/HUDI-828
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Balaji Varadarajan
>Priority: Major
>
> This ticket tracks open questions that needs to be resolved before we checkin 
> bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


vinothchandar commented on issue #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#issuecomment-618017951


   @lamber-ken we are hoping to get this into the next release as well . any 
etas on final review? :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


vinothchandar commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r413300954



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieTimer.java
##
@@ -69,4 +76,13 @@ public long endTimer() {
 }
 return timeInfoDeque.pop().stop();
   }
+
+  public long deltaTime() {

Review comment:
   let's just reuse the timer class as is and move on? :) I'd actually 
encourage explicit start and end calls so its clear what is being measured? In 
any case, the current way of overloading `HoodieTimer` does not seem 
maintainable IMO 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on issue #1402: [WIP][HUDI-407] Adding Simple Index

2020-04-22 Thread GitBox


vinothchandar commented on issue #1402:
URL: https://github.com/apache/incubator-hudi/pull/1402#issuecomment-618015872


   @nsivabalan any ETAs on when this will be open for a full final review? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

2020-04-22 Thread GitBox


vinothchandar commented on issue #1289:
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-618015542


   @prashantwason this will be a good candidate to fast track into the next 
release. are you still working on this? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1540: [HUDI-819] Fix a bug with MergeOnReadLazyInsertIterable.

2020-04-22 Thread GitBox


vinothchandar commented on a change in pull request #1540:
URL: https://github.com/apache/incubator-hudi/pull/1540#discussion_r413294812



##
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieHandleCreator.java
##
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+public interface HoodieHandleCreator {

Review comment:
   Can we rename things per a standard factory pattern here?  
`WriteHandleCreatorFactory` with a `create()` method? 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/DefaultHoodieWriteHandleCreator.java
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+public class DefaultHoodieWriteHandleCreator 
implements HoodieHandleCreator {

Review comment:
   this will then become `CreateHandleFactory` 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieAppendHandleCreator.java
##
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+public class HoodieAppendHandleCreator extends 
DefaultHoodieWriteHandleCreator {

Review comment:
   and this `AppendHandleFactory` 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


lamber-ken edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617976874


   hi @harshi2506, based on the above analysis, please try to increate the 
upsert parallelism(`hoodie.upsert.shuffle.parallelism`) and spark executor 
instances, for example
   ```
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
 --master yarn \
 --driver-memory 6G \
 --num-executors 10 \
 --executor-cores 5 \
 --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   
   val hudiOptions = Map[String,String](
 "hoodie.upsert.shuffle.parallelism" -> "200",
 "hoodie.datasource.write.recordkey.field" -> "id",
 "hoodie.datasource.write.partitionpath.field" -> "key", 
 "hoodie.table.name" -> tableName,
 "hoodie.datasource.write.precombine.field" -> "timestamp",
 "hoodie.memory.merge.max.size" -> "200485760"
   )
   
   val inputDF = spark.range(1, 300).
  withColumn("key", $"id").
  withColumn("data", lit("data")).
  withColumn("timestamp", current_timestamp()).
  withColumn("dt", date_format($"timestamp", "-MM-dd"))
   
   inputDF.write.format("org.apache.hudi").
 options(hudiOptions).
 mode("Append").
 save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*").show();
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1512: [HUDI-763] Add hoodie.table.base.file.format option to hoodie.properties file

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1512:
URL: https://github.com/apache/incubator-hudi/pull/1512#discussion_r413255248



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java
##
@@ -22,7 +22,7 @@
  * Hoodie file format.
  */
 public enum HoodieFileFormat {
-  PARQUET(".parquet"), HOODIE_LOG(".log");
+  PARQUET(".parquet"), HOODIE_LOG(".log"), ORC(".orc");

Review comment:
   Reasonable, done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1512: [HUDI-763] Add hoodie.table.base.file.format option to hoodie.properties file

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1512:
URL: https://github.com/apache/incubator-hudi/pull/1512#discussion_r413254988



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
##
@@ -190,6 +190,9 @@ public Operation convert(String value) throws 
ParameterException {
 @Parameter(names = {"--table-type"}, description = "Type of table. 
COPY_ON_WRITE (or) MERGE_ON_READ", required = true)
 public String tableType;
 
+@Parameter(names = {"--table-file-format"}, description = "BaseFileFormat 
of table. PARQUET (or) ORC")

Review comment:
   Reasonable, done

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HDFSParquetImporter.java
##
@@ -251,6 +252,8 @@ public void validate(String name, String value) {
 public String tableName = null;
 @Parameter(names = {"--table-type", "-tt"}, description = "Table type", 
required = true)
 public String tableType = null;
+@Parameter(names = {"--table-file-format", "-tff"}, description = "The 
base file storage format")

Review comment:
   Reasonable, done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617976874


   hi @harshi2506, based on the above analysis, please try to increate the 
upsert parallelism and spark executor instances, for example
   ```
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
 --master yarn \
 --driver-memory 6G \
 --num-executors 10 \
 --executor-cores 5 \
 --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   
   val hudiOptions = Map[String,String](
 "hoodie.upsert.shuffle.parallelism" -> "200",
 "hoodie.datasource.write.recordkey.field" -> "id",
 "hoodie.datasource.write.partitionpath.field" -> "key", 
 "hoodie.table.name" -> tableName,
 "hoodie.datasource.write.precombine.field" -> "timestamp",
 "hoodie.memory.merge.max.size" -> "200485760"
   )
   
   val inputDF = spark.range(1, 300).
  withColumn("key", $"id").
  withColumn("data", lit("data")).
  withColumn("timestamp", current_timestamp()).
  withColumn("dt", date_format($"timestamp", "-MM-dd"))
   
   inputDF.write.format("org.apache.hudi").
 options(hudiOptions).
 mode("Append").
 save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*").show();
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617965381


   From the detailed commit metadata and above spark ui, we know
   
   1. write about 700 million records at first commit.
   2. upsert 2633 records and touched 255 partitions.
   
   **Key point**: small input upsert to large old data file slide, and touched 
many partitions.
   
   This is a different case, so the `hoodie.memory.merge.max.size` option will 
not work, 
   So I think it takes a lot of time to read the old file, WDYT? @vinothchandar 
   
   
![image](https://user-images.githubusercontent.com/20113411/80020370-7f97a300-850b-11ea-9489-8e7002e98cda.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch master updated: [MINOR]: Fix cli docs for DeltaStreamer (#1547)

2020-04-22 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 19cc15c  [MINOR]: Fix cli docs for DeltaStreamer (#1547)
19cc15c is described below

commit 19cc15c0987043175685aaeb45facb15af23e34f
Author: dengziming 
AuthorDate: Thu Apr 23 02:37:17 2020 +0800

[MINOR]: Fix cli docs for DeltaStreamer (#1547)
---
 .../org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java| 2 +-
 .../hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
index 3b714b6..99bf696 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
@@ -185,7 +185,7 @@ public class HoodieDeltaStreamer implements Serializable {
 "file://" + System.getProperty("user.dir") + 
"/src/test/resources/delta-streamer-config/dfs-source.properties";
 
 @Parameter(names = {"--hoodie-conf"}, description = "Any configuration 
that can be set in the properties file "
-+ "(using the CLI parameter \"--propsFilePath\") can also be passed 
command line using this parameter")
++ "(using the CLI parameter \"--props\") can also be passed command 
line using this parameter")
 public List configs = new ArrayList<>();
 
 @Parameter(names = {"--source-class"},
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
index 58501f6..e43bca3 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
@@ -226,7 +226,7 @@ public class HoodieMultiTableDeltaStreamer {
 "file://" + System.getProperty("user.dir") + 
"/src/test/resources/delta-streamer-config/dfs-source.properties";
 
 @Parameter(names = {"--hoodie-conf"}, description = "Any configuration 
that can be set in the properties file "
-+ "(using the CLI parameter \"--propsFilePath\") can also be passed 
command line using this parameter")
++ "(using the CLI parameter \"--props\") can also be passed command 
line using this parameter")
 public List configs = new ArrayList<>();
 
 @Parameter(names = {"--source-class"},



[incubator-hudi] branch hudi_test_suite_refactor updated (8a0380c -> 388842c)

2020-04-22 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


omit 8a0380c  Testing running 3 builds to limit total build time
 add 388842c  Testing running 3 builds to limit total build time

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (8a0380c)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (388842c)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 scripts/run_travis_tests.sh | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)



[GitHub] [incubator-hudi] harshi2506 commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


harshi2506 commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617906795


   HI @lamber-ken , thanks for the response, attaching the screenshots
   https://user-images.githubusercontent.com/64137937/80011182-24ed4f80-84e9-11ea-8a44-938a8d352b6f.png;>
   https://user-images.githubusercontent.com/64137937/80011201-29b20380-84e9-11ea-86af-f526ff2d44d7.png;>
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-773) Hudi On Azure Data Lake Storage V2

2020-04-22 Thread Sasikumar Venkatesh (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089852#comment-17089852
 ] 

Sasikumar Venkatesh commented on HUDI-773:
--

[~vinoth] and [~garyli1019]

My Setup is As follows. 

I have a file system in ADLS2(Similar to S3 bucket) and setup OAuth based 
connections
{code:java}
spark.conf.set("fs.azure.account.auth.type.<>.dfs.core.windows.net",
 "OAuth") 
spark.conf.set("fs.azure.account.oauth.provider.type.<>.dfs.core.windows.net",
 "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") 
spark.conf.set("fs.azure.account.oauth2.client.id.<>.dfs.core.windows.net",
 "**redacted") 
spark.conf.set("fs.azure.account.oauth2.client.secret.<>.dfs.core.windows.net",
 "**redacted") 
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<>.dfs.core.windows.net",
 "https://login.microsoftonline.com/<>/oauth2/token")
{code}
The Hudi Write is as follows,
{code:java}
val tableName = "customer" 
customerDF.write.format("org.apache.hudi") .options(hudiOptions) 
.mode(SaveMode.Overwrite).save("abfss://<>.dfs.core.windows.net/hudi-tables/customer")
{code}
I could see that When I am trying to write with Hudi Format, It is trying to 
load the ADLS credentials from org.apache.hadoop.fs.azurebfs.AbfsConfiguration, 
 Since My Credential properties, is set to  `fs.azure.account.*`.

*I am wondering is there any special config I need to add in My Spark Conf to 
write to ADLS in Hudi format.* 

The stack trace of the error is given below:
{code:java}
Configuration property <>.dfs.core.windows.net not found. at 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:392)
 at 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1008)
 at 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:151)
 at 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:106)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at 
org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:81) at 
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91) at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:147)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:135)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:188)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:184) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:135) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:118)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:116) 
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
 at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:113)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:242)
 at 
org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:99)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:172)
 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710) 
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:306) 
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:292) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:235) at 
line8353df56311d44ef989b6a6d378b55bd92.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-4384834320483321:4)
 at 
line8353df56311d44ef989b6a6d378b55bd92.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-4384834320483321:73)
 at 

[incubator-hudi] branch master updated (26684f5 -> aea7c16)

2020-04-22 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 26684f5  [HUDI-816] Fixed MAX_MEMORY_FOR_MERGE_PROP and 
MAX_MEMORY_FOR_COMPACTION_PROP do not work due to HUDI-678 (#1536)
 add aea7c16  [HUDI-795] Handle auto-deleted empty aux folder (#1515)

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/table/HoodieCommitArchiveLog.java  | 24 +++---
 1 file changed, 21 insertions(+), 3 deletions(-)



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

2020-04-22 Thread GitBox


lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617895350


   Hi @harshi2506, please share the Spark stage UI, thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on issue #1548: [WIP] [HUDI-785] Refactor compaction/savepoint execution based on ActionExecutor abstraction

2020-04-22 Thread GitBox


vinothchandar commented on issue #1548:
URL: https://github.com/apache/incubator-hudi/pull/1548#issuecomment-617887802


   CI does seem to have passed.. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-810) Migrate HoodieClientTestHarness to JUnit 5

2020-04-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-810:

Labels: pull-request-available  (was: )

> Migrate HoodieClientTestHarness to JUnit 5
> --
>
> Key: HUDI-810
> URL: https://issues.apache.org/jira/browse/HUDI-810
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Once HoodieClientTestHarness migrated, the whole hudi-cli package can be 
> migrated all together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] xushiyan opened a new pull request #1553: [HUDI-810] Migrate ClientTestHarness to JUnit 5

2020-04-22 Thread GitBox


xushiyan opened a new pull request #1553:
URL: https://github.com/apache/incubator-hudi/pull/1553


   WIP
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch asf-site updated: Travis CI build asf-site

2020-04-22 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 4254e60  Travis CI build asf-site
4254e60 is described below

commit 4254e606504a6b0351176f8bb59f9a830d4b66e6
Author: CI 
AuthorDate: Wed Apr 22 15:53:20 2020 +

Travis CI build asf-site
---
 content/assets/js/lunr/lunr-store.js | 2 +-
 content/cn/docs/writing_data.html| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/content/assets/js/lunr/lunr-store.js 
b/content/assets/js/lunr/lunr-store.js
index 70f1edb..1d0335b 100644
--- a/content/assets/js/lunr/lunr-store.js
+++ b/content/assets/js/lunr/lunr-store.js
@@ -600,7 +600,7 @@ var store = [{
 "url": "https://hudi.apache.org/docs/concepts.html;,
 "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
 "title": "写入 Hudi 数据集",
-"excerpt":"这一节我们将介绍使用DeltaStreamer工具从外部源甚至其他Hudi数据集摄取新更改的方法, 
以及通过使用Hudi数据源的upserts加快大型Spark作业的方法。 对于此类数据集,我们可以使用各种查询引擎查询它们。 写操作 
在此之前,了解Hudi数据源及delta streamer工具提供的三种不同的写操作以及如何最佳利用它们可能会有所帮助。 
这些操作可以在针对数据集发出的每个提交/增量提交中进行选择/更改。 UPSERT(插入更新) 
:这是默认操作,在该操作中,通过查找索引,首先将输入记录标记为插入或更新。 
在运行启发式方法以确定如何最好地将这些记录放到存储上,如优化文件大小之类后,这些记录最终会被写入。 
对于诸如数据库更改捕获之类的用例,建议该操作,因为输入几乎肯定包含更新。 INSERT(插入) :就使用启发式方法确定文件大小�
 �言,此操作与插入更新(UPSERT)非常相似,但此操作完全跳过了索引查找步骤。 
因此,对于日志重复数据删除等用例(结合下面提到的过滤重复项的选项),它可以比插入更新快得多。 插入也适用于这种用 [...]
+"excerpt":"这一节我们将介绍使用DeltaStreamer工具从外部源甚至其他Hudi数据集摄取新更改的方法, 
以及通过使用Hudi数据源的upserts加快大型Spark作业的方法。 对于此类数据集,我们可以使用各种查询引擎查询它们。 写操作 
在此之前,了解Hudi数据源及delta streamer工具提供的三种不同的写操作以及如何最佳利用它们可能会有所帮助。 
这些操作可以在针对数据集发出的每个提交/增量提交中进行选择/更改。 UPSERT(插入更新) 
:这是默认操作,在该操作中,通过查找索引,首先将输入记录标记为插入或更新。 
在运行启发式方法以确定如何最好地将这些记录放到存储上,如优化文件大小之后,这些记录最终会被写入。 
对于诸如数据库更改捕获之类的用例,建议该操作,因为输入几乎肯定包含更新。 INSERT(插入) :就使用启发式方法确定文件大小而�
 �,此操作与插入更新(UPSERT)非常相似,但此操作完全跳过了索引查找步骤。 
因此,对于日志重复数据删除等用例(结合下面提到的过滤重复项的选项),它可以比插入更新快得多。 插入也适用于这种用例 [...]
 "tags": [],
 "url": "https://hudi.apache.org/cn/docs/writing_data.html;,
 "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
diff --git a/content/cn/docs/writing_data.html 
b/content/cn/docs/writing_data.html
index 24ed3d0..c7f7903 100644
--- a/content/cn/docs/writing_data.html
+++ b/content/cn/docs/writing_data.html
@@ -356,7 +356,7 @@
 
 
   UPSERT(插入更新) :这是默认操作,在该操作中,通过查找索引,首先将输入记录标记为插入或更新。
- 在运行启发式方法以确定如何最好地将这些记录放到存储上,如优化文件大小之类后,这些记录最终会被写入。
+ 在运行启发式方法以确定如何最好地将这些记录放到存储上,如优化文件大小之后,这些记录最终会被写入。
  对于诸如数据库更改捕获之类的用例,建议该操作,因为输入几乎肯定包含更新。
   INSERT(插入) 
:就使用启发式方法确定文件大小而言,此操作与插入更新(UPSERT)非常相似,但此操作完全跳过了索引查找步骤。
  因此,对于日志重复数据删除等用例(结合下面提到的过滤重复项的选项),它可以比插入更新快得多。



[incubator-hudi] branch asf-site updated: [HUDI-827] fix translation error (#1551)

2020-04-22 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 55dde43  [HUDI-827] fix translation error (#1551)
55dde43 is described below

commit 55dde438e8adf24b0b03bde589d4cee9b3cb663c
Author: wanglisheng81 <37138788+wanglishen...@users.noreply.github.com>
AuthorDate: Wed Apr 22 23:33:34 2020 +0800

[HUDI-827] fix translation error (#1551)
---
 docs/_docs/2_2_writing_data.cn.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_docs/2_2_writing_data.cn.md 
b/docs/_docs/2_2_writing_data.cn.md
index aa3ccef..33fb2dc 100644
--- a/docs/_docs/2_2_writing_data.cn.md
+++ b/docs/_docs/2_2_writing_data.cn.md
@@ -18,7 +18,7 @@ language: cn
 这些操作可以在针对数据集发出的每个提交/增量提交中进行选择/更改。
 
  - **UPSERT(插入更新)** :这是默认操作,在该操作中,通过查找索引,首先将输入记录标记为插入或更新。
- 在运行启发式方法以确定如何最好地将这些记录放到存储上,如优化文件大小之类后,这些记录最终会被写入。
+ 在运行启发式方法以确定如何最好地将这些记录放到存储上,如优化文件大小之后,这些记录最终会被写入。
  对于诸如数据库更改捕获之类的用例,建议该操作,因为输入几乎肯定包含更新。
  - **INSERT(插入)** :就使用启发式方法确定文件大小而言,此操作与插入更新(UPSERT)非常相似,但此操作完全跳过了索引查找步骤。
  因此,对于日志重复数据删除等用例(结合下面提到的过滤重复项的选项),它可以比插入更新快得多。



[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r413014968



##
File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java
##
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.LazyIterableIterator;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.io.HoodieBloomRangeInfoHandle;
+import org.apache.hudi.io.HoodieKeyLookupHandle;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+/**
+ * Simplified re-implementation of {@link HoodieBloomIndex} that does not rely 
on caching, or
+ * incurs the overhead of auto-tuning parallelism.
+ */
+public class HoodieBloomIndexV2 extends 
HoodieIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieBloomIndexV2.class);
+
+  public HoodieBloomIndexV2(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD,
+  JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+return recordRDD
+.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), 
record.getRecordKey()),
+true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyRangeBloomChecker(itr, 
hoodieTable)).flatMap(List::iterator)
+.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
+.filter(Option::isPresent)
+.map(Option::get);
+  }
+
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+  JavaSparkContext jsc, HoodieTable hoodieTable) {
+throw new UnsupportedOperationException("Not yet implemented");
+  }
+
+  @Override
+  public boolean rollbackCommit(String commitTime) {
+// Nope, don't need to do anything.
+return true;
+  }
+
+  /**
+   * This is not global, since we depend on the partitionPath to do the lookup.
+   */
+  @Override
+  public boolean isGlobal() {
+return false;
+  }
+
+  /**
+   * No indexes into log files yet.
+   */
+  @Override
+  public boolean canIndexLogFiles() {
+return false;
+  }
+
+  /**
+   * Bloom filters are stored, into the same data files.
+   */
+  @Override
+  public boolean isImplicitWithStorage() {
+return true;
+  }
+
+  @Override
+  public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
+   HoodieTable hoodieTable) {
+return writeStatusRDD;
+  }
+
+  /**
+   * Given an iterator of hoodie records, returns a pair of candidate 
HoodieRecord, FileID pairs.
+   */
+  class LazyRangeBloomChecker extends

Review comment:
   Done.

##
File path: 

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r412995625



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -430,6 +430,14 @@ public boolean getBloomIndexUpdatePartitionPath() {
 return 
Boolean.parseBoolean(props.getProperty(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH));
   }
 
+  public int getBloomIndexV2Parallelism() {
+return 
Integer.parseInt(props.getProperty(HoodieIndexConfig.BLOOM_INDEX_V2_PARALLELISM_PROP));
+  }
+
+  public long getBloomIndexV2BufferSize() {

Review comment:
   Good catch





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r412992192



##
File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java
##
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.LazyIterableIterator;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.io.HoodieBloomRangeInfoHandle;
+import org.apache.hudi.io.HoodieKeyLookupHandle;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+/**
+ * Simplified re-implementation of {@link HoodieBloomIndex} that does not rely 
on caching, or
+ * incurs the overhead of auto-tuning parallelism.
+ */
+public class HoodieBloomIndexV2 extends 
HoodieIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieBloomIndexV2.class);
+
+  public HoodieBloomIndexV2(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD,
+  JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+return recordRDD
+.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), 
record.getRecordKey()),
+true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyRangeBloomChecker(itr, 
hoodieTable)).flatMap(List::iterator)
+.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
+.filter(Option::isPresent)
+.map(Option::get);
+  }
+
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+  JavaSparkContext jsc, HoodieTable hoodieTable) {
+throw new UnsupportedOperationException("Not yet implemented");
+  }
+
+  @Override
+  public boolean rollbackCommit(String commitTime) {
+// Nope, don't need to do anything.
+return true;
+  }
+
+  /**
+   * This is not global, since we depend on the partitionPath to do the lookup.
+   */
+  @Override
+  public boolean isGlobal() {
+return false;
+  }
+
+  /**
+   * No indexes into log files yet.
+   */
+  @Override
+  public boolean canIndexLogFiles() {
+return false;
+  }
+
+  /**
+   * Bloom filters are stored, into the same data files.
+   */
+  @Override
+  public boolean isImplicitWithStorage() {
+return true;
+  }
+
+  @Override
+  public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
+   HoodieTable hoodieTable) {
+return writeStatusRDD;
+  }
+
+  /**
+   * Given an iterator of hoodie records, returns a pair of candidate 
HoodieRecord, FileID pairs.
+   */
+  class LazyRangeBloomChecker extends
+  LazyIterableIterator, List, 
String>>> {
+
+private HoodieTable table;
+private String 

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r412988167



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieTimer.java
##
@@ -69,4 +76,13 @@ public long endTimer() {
 }
 return timeInfoDeque.pop().stop();
   }
+
+  public long deltaTime() {

Review comment:
   hi @nsivabalan, as follown, we need call `startTimer()` and `endTimer()`.
   ```
   // HoodieTimer 
   T0: HoodieTimer timer = new HoodieTimer(); timer.startTimer();
   T1: timer.endTimer(); timer.startTimer();
   T2: timer.endTimer(); timer.startTimer();
   T3: timer.endTimer(); timer.startTimer();  
   T4: timer.endTimer(); timer.startTimer();  
   
   // Delta
   T0: HoodieTimer timer = new HoodieTimer();
   T1: timer.deltaTime();
   T2: timer.deltaTime();
   T3: timer.deltaTime();
   T4: timer.deltaTime();
   ```
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


nsivabalan commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r412955401



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -430,6 +430,14 @@ public boolean getBloomIndexUpdatePartitionPath() {
 return 
Boolean.parseBoolean(props.getProperty(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH));
   }
 
+  public int getBloomIndexV2Parallelism() {
+return 
Integer.parseInt(props.getProperty(HoodieIndexConfig.BLOOM_INDEX_V2_PARALLELISM_PROP));
+  }
+
+  public long getBloomIndexV2BufferSize() {

Review comment:
   nit: MaxBufferSize() when you change the name as per previous comment

##
File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java
##
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.LazyIterableIterator;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.io.HoodieBloomRangeInfoHandle;
+import org.apache.hudi.io.HoodieKeyLookupHandle;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+/**
+ * Simplified re-implementation of {@link HoodieBloomIndex} that does not rely 
on caching, or
+ * incurs the overhead of auto-tuning parallelism.
+ */
+public class HoodieBloomIndexV2 extends 
HoodieIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieBloomIndexV2.class);
+
+  public HoodieBloomIndexV2(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD,
+  JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+return recordRDD
+.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), 
record.getRecordKey()),
+true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyRangeBloomChecker(itr, 
hoodieTable)).flatMap(List::iterator)
+.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
+.filter(Option::isPresent)
+.map(Option::get);
+  }
+
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+  JavaSparkContext jsc, HoodieTable hoodieTable) {
+throw new UnsupportedOperationException("Not yet implemented");
+  }
+
+  @Override
+  public boolean rollbackCommit(String commitTime) {
+// Nope, don't need to do anything.
+return true;
+  }
+
+  /**
+   * This is not global, since we depend on the partitionPath to do the lookup.
+   */
+  @Override
+  public boolean isGlobal() {
+return false;
+  }
+
+  /**
+   * No indexes into log files yet.
+   */
+  @Override
+  public boolean canIndexLogFiles() {
+return false;
+  }
+
+  /**

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r412965435



##
File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java
##
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.LazyIterableIterator;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.io.HoodieBloomRangeInfoHandle;
+import org.apache.hudi.io.HoodieKeyLookupHandle;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+/**
+ * Simplified re-implementation of {@link HoodieBloomIndex} that does not rely 
on caching, or
+ * incurs the overhead of auto-tuning parallelism.
+ */
+public class HoodieBloomIndexV2 extends 
HoodieIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieBloomIndexV2.class);
+
+  public HoodieBloomIndexV2(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD,
+  JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+return recordRDD
+.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), 
record.getRecordKey()),
+true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyRangeBloomChecker(itr, 
hoodieTable)).flatMap(List::iterator)
+.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
+.filter(Option::isPresent)
+.map(Option::get);
+  }
+
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+  JavaSparkContext jsc, HoodieTable hoodieTable) {
+throw new UnsupportedOperationException("Not yet implemented");
+  }
+
+  @Override
+  public boolean rollbackCommit(String commitTime) {
+// Nope, don't need to do anything.
+return true;
+  }
+
+  /**
+   * This is not global, since we depend on the partitionPath to do the lookup.
+   */
+  @Override
+  public boolean isGlobal() {
+return false;
+  }
+
+  /**
+   * No indexes into log files yet.
+   */
+  @Override
+  public boolean canIndexLogFiles() {
+return false;
+  }
+
+  /**
+   * Bloom filters are stored, into the same data files.
+   */
+  @Override
+  public boolean isImplicitWithStorage() {
+return true;
+  }
+
+  @Override
+  public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
+   HoodieTable hoodieTable) {
+return writeStatusRDD;
+  }
+
+  /**
+   * Given an iterator of hoodie records, returns a pair of candidate 
HoodieRecord, FileID pairs.
+   */
+  class LazyRangeBloomChecker extends
+  LazyIterableIterator, List, 
String>>> {
+
+private HoodieTable table;
+private String 

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r412965622



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
##
@@ -68,6 +68,13 @@
   public static final String BLOOM_INDEX_KEYS_PER_BUCKET_PROP = 
"hoodie.bloom.index.keys.per.bucket";
   public static final String DEFAULT_BLOOM_INDEX_KEYS_PER_BUCKET = "1000";
 
+  public static final String BLOOM_INDEX_V2_PARALLELISM_PROP = 
"hoodie.bloom.index.v2.parallelism";
+  public static final String DEFAULT_BLOOM_V2_INDEX_PARALLELISM = "20";
+
+  public static final String BLOOM_INDEX_V2_BUFFER_SIZE_PROP = 
"hoodie.bloom.index.v2.buffer.max.size";

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r412965164



##
File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java
##
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.LazyIterableIterator;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.io.HoodieBloomRangeInfoHandle;
+import org.apache.hudi.io.HoodieKeyLookupHandle;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+/**
+ * Simplified re-implementation of {@link HoodieBloomIndex} that does not rely 
on caching, or
+ * incurs the overhead of auto-tuning parallelism.
+ */
+public class HoodieBloomIndexV2 extends 
HoodieIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieBloomIndexV2.class);
+
+  public HoodieBloomIndexV2(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD,
+  JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+return recordRDD
+.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), 
record.getRecordKey()),
+true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyRangeBloomChecker(itr, 
hoodieTable)).flatMap(List::iterator)
+.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
+.filter(Option::isPresent)
+.map(Option::get);
+  }
+
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+  JavaSparkContext jsc, HoodieTable hoodieTable) {
+throw new UnsupportedOperationException("Not yet implemented");
+  }
+
+  @Override
+  public boolean rollbackCommit(String commitTime) {
+// Nope, don't need to do anything.
+return true;
+  }
+
+  /**
+   * This is not global, since we depend on the partitionPath to do the lookup.
+   */
+  @Override
+  public boolean isGlobal() {
+return false;
+  }
+
+  /**
+   * No indexes into log files yet.
+   */
+  @Override
+  public boolean canIndexLogFiles() {
+return false;
+  }
+
+  /**
+   * Bloom filters are stored, into the same data files.
+   */
+  @Override
+  public boolean isImplicitWithStorage() {
+return true;
+  }
+
+  @Override
+  public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
+   HoodieTable hoodieTable) {
+return writeStatusRDD;
+  }
+
+  /**
+   * Given an iterator of hoodie records, returns a pair of candidate 
HoodieRecord, FileID pairs.
+   */
+  class LazyRangeBloomChecker extends
+  LazyIterableIterator, List, 
String>>> {
+
+private HoodieTable table;
+private String 

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-04-22 Thread GitBox


lamber-ken commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r412946904



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieTimer.java
##
@@ -69,4 +76,13 @@ public long endTimer() {
 }
 return timeInfoDeque.pop().stop();
   }
+
+  public long deltaTime() {

Review comment:
   `endTimer` can only support 1 v 1, what we need here is `delta` 
characteristics.
   if use `endTimer`, we must create many instances, only one instance is 
enough when we use `delta`. 
   
   
![image](https://user-images.githubusercontent.com/20113411/79982927-9ae9ba80-84d9-11ea-84af-1a6e7565a482.png)
   
   
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-827) Translation error

2020-04-22 Thread Lisheng Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Wang updated HUDI-827:
--
Status: In Progress  (was: Open)

> Translation error
> -
>
> Key: HUDI-827
> URL: https://issues.apache.org/jira/browse/HUDI-827
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: docs-chinese
>Affects Versions: 0.5.2
>Reporter: Lisheng Wang
>Assignee: Lisheng Wang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> found translation error in 
> [https://hudi.apache.org/cn/docs/writing_data.html], 
> "如优化文件大小之类后", should be "如优化文件大小之后" 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-827) Translation error

2020-04-22 Thread Lisheng Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Wang updated HUDI-827:
--
Status: Open  (was: New)

> Translation error
> -
>
> Key: HUDI-827
> URL: https://issues.apache.org/jira/browse/HUDI-827
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: docs-chinese
>Affects Versions: 0.5.2
>Reporter: Lisheng Wang
>Assignee: Lisheng Wang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> found translation error in 
> [https://hudi.apache.org/cn/docs/writing_data.html], 
> "如优化文件大小之类后", should be "如优化文件大小之后" 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1547: [MINOR]: Fix annotations in HoodieDeltaStreamer

2020-04-22 Thread GitBox


pratyakshsharma commented on issue #1547:
URL: https://github.com/apache/incubator-hudi/pull/1547#issuecomment-617745980


   LGTM. :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] dengziming commented on a change in pull request #1547: [MINOR]: Fix annotations in HoodieDeltaStreamer

2020-04-22 Thread GitBox


dengziming commented on a change in pull request #1547:
URL: https://github.com/apache/incubator-hudi/pull/1547#discussion_r412932116



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
##
@@ -185,7 +185,7 @@ public Operation convert(String value) throws 
ParameterException {
 "file://" + System.getProperty("user.dir") + 
"/src/test/resources/delta-streamer-config/dfs-source.properties";
 
 @Parameter(names = {"--hoodie-conf"}, description = "Any configuration 
that can be set in the properties file "
-+ "(using the CLI parameter \"--propsFilePath\") can also be passed 
command line using this parameter")

Review comment:
   thank you for your reminding, done!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on issue #1551: [HUDI-827] fix translation error

2020-04-22 Thread GitBox


lamber-ken commented on issue #1551:
URL: https://github.com/apache/incubator-hudi/pull/1551#issuecomment-617744298


   LGTM, let's wait @leesf make a final pass.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1538: [HUDI-803]: added more test cases in TestHoodieAvroUtils.class

2020-04-22 Thread GitBox


pratyakshsharma commented on issue #1538:
URL: https://github.com/apache/incubator-hudi/pull/1538#issuecomment-617743201


   > I need to wrap my head around some of this myself.. So please give me 
sometime to review..
   
   Sure. :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1538: [HUDI-803]: added more test cases in TestHoodieAvroUtils.class

2020-04-22 Thread GitBox


pratyakshsharma commented on a change in pull request #1538:
URL: https://github.com/apache/incubator-hudi/pull/1538#discussion_r412927375



##
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##
@@ -60,7 +61,7 @@
   private static ThreadLocal reuseDecoder = 
ThreadLocal.withInitial(() -> null);
 
   // All metadata fields are optional strings.
-  private static final Schema METADATA_FIELD_SCHEMA =
+  public static final Schema METADATA_FIELD_SCHEMA =

Review comment:
   Done. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1515: [HUDI-795] Ignoring missing aux folder

2020-04-22 Thread GitBox


pratyakshsharma commented on issue #1515:
URL: https://github.com/apache/incubator-hudi/pull/1515#issuecomment-617739210


   LGTM.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1433: [HUDI-728]: Implement custom key generator

2020-04-22 Thread GitBox


pratyakshsharma commented on issue #1433:
URL: https://github.com/apache/incubator-hudi/pull/1433#issuecomment-617737506


   @vinothchandar Let us close this? :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on issue #765: [WIP] Fix KafkaAvroSource to use the latest schema

2020-04-22 Thread GitBox


pratyakshsharma commented on issue #765:
URL: https://github.com/apache/incubator-hudi/pull/765#issuecomment-617735742


   > @pratyakshsharma Nope.. all yours if you want to take a run at it
   
   Sure, will be working on it next then :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >