date:20200414

pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408598302
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   > parquet-avro reader breaks as it expects every field in parquet schema to 
be present in avro. I have a fix to parquet-avro to skip those fields, will 
send a PR to show so we can discuss whether it is a good idea or not
   
   Yes, please. Would love to see how you are approaching that fix. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch hudi_test_suite_refactor updated (39105e4 -> a7a3655)

2020-04-14 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


omit 39105e4  more fixes
 add a7a3655  more fixes

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (39105e4)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (a7a3655)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../reader/DFSHoodieDatasetInputReader.java  |  2 +-
 .../hudi/testsuite/TestDFSDeltaWriterAdapter.java|  2 +-
 .../hudi/testsuite/TestFileDeltaInputWriter.java |  2 +-
 .../reader/TestDFSHoodieDatasetInputReader.java  | 20 +---
 .../org/apache/hudi/utilities/UtilitiesTestBase.java |  6 +++---
 .../hudi/utilities/sources/TestCsvDFSSource.java |  2 +-
 6 files changed, 16 insertions(+), 18 deletions(-)

[jira] [Updated] (HUDI-796) Rewrite DedupeSparkJob.scala without considering the _hoodie_commit_time



 [ 
https://issues.apache.org/jira/browse/HUDI-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma updated HUDI-796:
--
Status: Open  (was: New)

> Rewrite DedupeSparkJob.scala without considering the _hoodie_commit_time
> 
>
> Key: HUDI-796
> URL: https://issues.apache.org/jira/browse/HUDI-796
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> _`_hoodie_commit_time` can only be used for deduping a partition path if 
> duplicates happened due to INSERT operation. In case of updates, bloom filter 
> tags both the files where a record is present for update, and all such files 
> will have the same `___hoodie_commit_time__` for a duplicate record 
> henceforth._ 
> _Hence it makes sense to rewrite this class without considering the metadata 
> field._ 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-796) Rewrite DedupeSparkJob.scala without considering the _hoodie_commit_time

Pratyaksh Sharma created HUDI-796:
-

 Summary: Rewrite DedupeSparkJob.scala without considering the 
_hoodie_commit_time
 Key: HUDI-796
 URL: https://issues.apache.org/jira/browse/HUDI-796
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Pratyaksh Sharma
Assignee: Pratyaksh Sharma


_`_hoodie_commit_time` can only be used for deduping a partition path if 
duplicates happened due to INSERT operation. In case of updates, bloom filter 
tags both the files where a record is present for update, and all such files 
will have the same `___hoodie_commit_time__` for a duplicate record 
henceforth._ 

_Hence it makes sense to rewrite this class without considering the metadata 
field._ 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-432) Benchmark HFile for scan vs seek



[ 
https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083780#comment-17083780
 ] 

Vinoth Chandar commented on HUDI-432:
-

[~uditme] [~rahulbhartia] [~vbalaji] : this is the benchmarking done so far... 

> Benchmark HFile for scan vs seek
> 
>
> Key: HUDI-432
> URL: https://issues.apache.org/jira/browse/HUDI-432
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Performance, Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: HFile benchmark.xlsx, HFile benchmark_withS3.xlsx, 
> Screen Shot 2020-01-03 at 6.44.25 PM.png, Screen Shot 2020-03-09 at 12.22.54 
> AM.png
>
>
> We want to benchmark HFile scan vs seek as we intend to use HFile to record 
> indexing. HFile will be used inline in hudi log for index purposes. 
> So, as part of benchmarking, we want to see when does scan out performs seek. 
> This is our experiment set up.
> keysToRead = no of keys to be looked up. // differs for different exp runs 
> like 100k, 200k, 500k, 1M. 
> N = no of iterations
>  
> {code:java}
> 1M entries were written to a single HFile as key value pairs. 
> Also, stored the keys in a separate file(key_file).
> keyList = read all keys from key_file
> for N no of iterations
> {
> shuffle keyList 
> trim the list to keysToRead 
> start timer HFile 
> read benchmark(scan/seek) 
> end timer
> }
> found avg for all timers captured
> {code}
>  
>  
> Result:
> Scan outperforms seek somewhere around 350k to 400k look ups out of 1M 
> entries with optimized configs.
>   !Screen Shot 2020-01-03 at 6.44.25 PM.png!
> Results can be found here: [^HFile benchmark.xlsx]
> Source for benchmarking can be found here: 
> [https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #248

2020-04-14 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.37 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @

[jira] [Commented] (HUDI-760) Remove Rolling Stat management from Hudi Writer

2020-04-14 Thread renyi.bao (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083742#comment-17083742
 ] 

renyi.bao commented on HUDI-760:


hi,bv

     would you describe the problem in detail? I want to get involved

> Remove Rolling Stat management from Hudi Writer
> ---
>
> Key: HUDI-760
> URL: https://issues.apache.org/jira/browse/HUDI-760
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: help-wanted, newbie,
> Fix For: 0.6.0
>
>
> Current implementation of rolling stat is not scalable. As Consolidated 
> Metadata will be implemented eventually, we can have one design to manage 
> file-level stats too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] dengziming commented on issue #1151: [WIP][HUDI-476] Add hudi-examples module

dengziming commented on issue #1151: [WIP][HUDI-476] Add hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#issuecomment-613764273
 
 
   Thanks for your reply @vinothchandar . Do you mean we can just delete the 3 
XxxDeltaStreamerExample and replace them with a  run_examples.sh and some 
config files?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

vinothchandar edited a comment on issue #1457: [HUDI-741] Added checks to 
validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-613761497
 
 
   @prashantwason can you please 
   
   - reduce the size of the SchemaCompatibility file we copied from avro, 
reusing as much as we can.. this will also be a forcing function to revisit and 
pick up fixes here time and again if we update avro. 
   - please look at NOTICE/LICENSE file for how we have attributed other such 
code reuse and attribute them as well  
   
   
   I am not sure if the stricter checking we add here will affect the issue 
listed here  
https://github.com/apache/incubator-hudi/pull/1406#issuecomment-606240326 .. 
basically there was a difference in namespace (which IIUC matters when 
comparing schemas)... I am wondering if we can read 0.5.0 data using 0.5.2 for 
eg


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

vinothchandar commented on issue #1457: [HUDI-741] Added checks to validate 
Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-613761497
 
 
   @prashantwason can you please 
   -reduce the size of the SchemaCompatibility file we copied from avro, 
reusing as much as we can.. this will also be a forcing function to revisit and 
pick up fixes here time and again if we update avro. 
   - please look at NOTICE/LICENSE file for how we have attributed other such 
code reuse and attribute them as well  
   
   
   I am not sure if the stricter checking we add here will affect the issue 
listed here  
https://github.com/apache/incubator-hudi/pull/1406#issuecomment-606240326 .. 
basically there was a difference in namespace (which IIUC matters when 
comparing schemas)... I am wondering if we can read 0.5.0 data using 0.5.2 for 
eg


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Assigned] (HUDI-427) Implement CLI support for performing bootstrap

2020-04-14 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf reassigned HUDI-427:
--

Assignee: Wenning Ding

> Implement CLI support for performing bootstrap
> --
>
> Key: HUDI-427
> URL: https://issues.apache.org/jira/browse/HUDI-427
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: CLI
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Major
> Fix For: 0.6.0
>
>
> Need CLI to perform bootstrap as described in 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] afilipchik opened a new pull request #1518: [HUDI-723] Register avro schema if infered from SQL transformation

afilipchik opened a new pull request #1518: [HUDI-723] Register avro schema if 
infered from SQL transformation
URL: https://github.com/apache/incubator-hudi/pull/1518
 
 
   ## What is the purpose of the pull request
   If schema is inferred from spark dataframe after Sql Transformation. Not 
sure if it is going to work for continuous mode, need advice. 
   
   ## Brief change log
   
 - Modified DeltaSync to support force schema registration and made sure 
new schema is registered if Transformer is used. 
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik closed pull request #1518: [HUDI-723] Register avro schema if infered from SQL transformation

afilipchik closed pull request #1518: [HUDI-723] Register avro schema if 
infered from SQL transformation
URL: https://github.com/apache/incubator-hudi/pull/1518
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Assigned] (HUDI-427) Implement CLI support for performing bootstrap

2020-04-14 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf reassigned HUDI-427:
--

Assignee: (was: leesf)

> Implement CLI support for performing bootstrap
> --
>
> Key: HUDI-427
> URL: https://issues.apache.org/jira/browse/HUDI-427
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: CLI
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Need CLI to perform bootstrap as described in 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] yanghua commented on issue #1504: [HUDI-780] Migrate test cases to Junit 5

yanghua commented on issue #1504: [HUDI-780] Migrate test cases to Junit 5
URL: https://github.com/apache/incubator-hudi/pull/1504#issuecomment-613755671
 
 
   @xushiyan thanks for the update. I will take a look today.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-781) Re-design test utilities

2020-04-14 Thread vinoyang (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083724#comment-17083724
 ] 

vinoyang commented on HUDI-781:
---

+1 to distinguish the pure unit tests and engine-binding tests, this action 
would be helpful to unify more in the future.

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on issue #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

vinothchandar commented on issue #1457: [HUDI-741] Added checks to validate 
Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-613754532
 
 
   Just to level set, we have relied always to an external system validating 
the schema backwards compatibility outside of Hudi. bringing this inside, seems 
like a good idea based on what we have learnt in production over the past year 
seems like.
   
   I am supportive of that..  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks



[ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083712#comment-17083712
 ] 

Vinoth Chandar commented on HUDI-741:
-

[~pwason] once we merge this, could you also write a small blog post around the 
content posted in this JIRA.. this is worth upleveling and sharing broadly as 
schema evolution does come up often..

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 120h
>  Time Spent: 10m
>  Remaining Estimate: 119h 50m
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on issue #1151: [WIP][HUDI-476] Add hudi-examples module

vinothchandar commented on issue #1151: [WIP][HUDI-476] Add hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#issuecomment-613743806
 
 
   @dengziming thanks for getting back..  
   
   I think we can solve both problems for deltastreamer by simply reusing the 
deltastreamer from the utilities-bundle and just provide the command line 
arguments and config files via the run_examples.sh script...
   
   DataSource examples can hopefully be easily dealt with.. it may be worth it 
to even get that working first, if it helps us move forward.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] kwondw commented on issue #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

kwondw commented on issue #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#issuecomment-613743632
 
 
   This conflict isn't due to my change, it seems every time there is update on 
upstream master branch, the conflict could happen, I will merge with master 
once code review is all done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] kwondw commented on a change in pull request #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

kwondw commented on a change in pull request #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#discussion_r408507089
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##
 @@ -55,6 +55,7 @@
   private static final String DEFAULT_PARALLELISM = "1500";
   private static final String INSERT_PARALLELISM = 
"hoodie.insert.shuffle.parallelism";
   private static final String BULKINSERT_PARALLELISM = 
"hoodie.bulkinsert.shuffle.parallelism";
+  private static final String BULKINSERT_USER_DEFINED_PARTITIONER_CLASS = 
"hoodie.bulkinsert.user_defined.partitioner.class";
 
 Review comment:
   it has been updated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408480553
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   On schema generation, we had to do some stuff... Our service emits protobufs 
to kafka, which we transform to avro using ProtoToAvro converter. Because those 
are service to service messages, they are not perfect and we had to do some 
bending to make sure schema can evolve. Some things: 
   1) We made everything optional
   2) Avro and Protobuf allows records to have no fields. But parquet doesn't, 
so, we inject marker field called exists pretty much everywhere. We do it 
inside HUDI to avoid clutter in kafka. Originally added it only to records with 
no fields, but found out that when fields are added and exists is autoremoved, 
parquet-avro reader breaks as it expects every field in parquet schema to be 
present in avro. I have a fix to parquet-avro to skip those fields, will send a 
PR to show so we can discuss whether it is a good idea or not. 
   3) Then we sometimes run sql transformation (same kafka stream produces 
multiple outputs). In this case, we infer schema from Spark (using 
NullTargetSchemaProvider). This schema must be correct as well which is not the 
case as spark omits defaults, so compaction breaks. Have a PR to address.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408480553
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   On schema generation, we had to do some stuff... Our service emits protobufs 
to kafka, which we transform to avro using ProtoToAvro converter. Because those 
are service to service messages, they are not perfect and we had to do some 
bending to make sure schema can evolve. Some things: 
   1) We made everything optional
   2) Avro and Protobuf allows records to have no fields. But parquet doesn't, 
so, we inject marker field called exists pretty much everywhere. We do it 
inside HUDI to avoid clutter in kafka. Originally added it only to records with 
no fields, but found out that when fields are added and exists is autoremoved, 
parquet-avro reader breaks as it expects every field in parquet schema to be 
present in avro. I have a fix to parquet-avro to skip those fields, will send a 
PR to show so we can discuss whether it is a good idea or not. 
   3) Then we sometimes run sql transformation (same kafka stream produces 
multiple outputs). In this case, we infer schema from Spark (using 
NullTargetSchemaProvider). This schema must be correct as well which is not the 
case as spark omits defaults, so compaction breaks. Have a PR to adress.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408480553
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   On schema generation, we had to do some stuff... Our service emits protobufs 
to kafka, which we transform to avro using ProtoToAvro converter. Because those 
are service to service messages, they are not perfect and we had to do some 
bending to make sure schema can evolve. Some things: 
   1) We made everything optional
   2) Avro and Protobuf allows records to have no fields. But parquet doesn't, 
so, we inject marker field called exists pretty much everywhere. Originally 
added it only to records with no fields, but found out that when fields are 
added and exists is autoremoved, parquet-avro reader breaks as it expects every 
field in parquet schema to be present in avro. I have a fix to parquet-avro to 
skip those fields, will send a PR to show so we can discuss whether it is a 
good idea or not. 
   3) Then we sometimes run sql transformation (same kafka stream produces 
multiple outputs). In this case, we infer schema from Spark (using 
NullTargetSchemaProvider). This schema must be correct as well which is not the 
case as spark omits defaults, so compaction breaks. Have a PR to adress.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408477566
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   Sure, there might be another way. I think there are changes with nulls in 
1.9.0, but upgrading is not easy. Just to clarify, it is not for the write 
support, but for the read one. Consider situation:
   We have avro with 4 metada fields written, and defaults are not set. Not an 
issue as it is written.
   We add another metadata field (sourceOffset, or something), which also 
doesn't have default field and redeploy. When we try to read old record with 
new schema it will fail because old record doesn't have new field and there is 
no default for it. I hit this issue with spark-> avro transformation, have 
another PR to address


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408472216
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   Ok.. IMO hoodie metadata fields will never be written as null. If you 
strongly feel about doing this change, can we think of some alternate solution 
rather than going back to deprecated constructors. Also wanted to check with 
you how are you generating schema like you mentioned before? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-723) SqlTransformer's schema sometimes is not registered.



 [ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-723:

Labels: pull-request-available  (was: )

> SqlTransformer's schema sometimes is not registered. 
> -
>
> Key: HUDI-723
> URL: https://issues.apache.org/jira/browse/HUDI-723
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> If schema is inferred from RowBasedSchemaProvider when SQL transformer is 
> used it also needs to be registered. 
>  
> Current way only works if SchemaProvider has a valid target schema. Is one 
> wants to use schema from SQL transformation, the result of 
> RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
> {code:java}
> private void setupWriteClient(SchemaProvider schemaProvider) {
>   LOG.info("Setting up Hoodie Write Client");
>   registerAvroSchemas(schemaProvider);
>   HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
>   writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
>   onInitializingHoodieWriteClient.apply(writeClient);
> }
> {code}
> Existent method will not work as it is checking for:
> {code:java}
> if ((null != schemaProvider) && (null == writeClient)) {
> {code}
> and writeClient is already configured. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] afilipchik opened a new pull request #1518: [HUDI-723] Register avro schema if infered from SQL transformation

afilipchik opened a new pull request #1518: [HUDI-723] Register avro schema if 
infered from SQL transformation
URL: https://github.com/apache/incubator-hudi/pull/1518
 
 
   ## What is the purpose of the pull request
   If schema is inferred from spark dataframe after Sql Transformation. Not 
sure if it is going to work for continuous mode, need advice. 
   
   ## Brief change log
   
 - Modified DeltaSync to support force schema registration and made sure 
new schema is registered if Transformer is used. 
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408469635
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord 
record, LinkedHashSet
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
   if (record.get(f.name()) == null) {
-newRecord.put(f.name(), f.defaultVal());
+if (f.defaultVal() instanceof Null) {
+  newRecord.put(f.name(), null);
 
 Review comment:
   Ok so when using f.defaultVal(), NullNode corresponds to JsonProperties.Null 
class and hence the test case fails. Good catch :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Resolved] (HUDI-759) Integrate checkpoint provider

2020-04-14 Thread Yanjia Gary Li (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-759.
-
Resolution: Fixed

> Integrate checkpoint provider
> -
>
> Key: HUDI-759
> URL: https://issues.apache.org/jira/browse/HUDI-759
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-14 Thread Yanjia Gary Li (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082773#comment-17082773
 ] 

Yanjia Gary Li edited comment on HUDI-69 at 4/14/20, 10:11 PM:
---

After a closer look, I think Spark datasource support for realtime table needs:
 * Support hadoop.mapreduce.xxx apis. We use hadoop.mapred.RecordReader, but 
Spark sql use hadoop.mapreduce.RecordReader. We need to figure how to support 
both apis, or upgrade to mapreduce.   
 * Implement the extension of ParquetInputFormat from Spark or a custom data 
source reader to handle merge. 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala]
 * Use Datasource V2 to be the default data source. 

Please let me know what you guys think. 


was (Author: garyli1019):
After a closer look, I think Spark datasource support for realtime table needs:
 * Refactoring HoodieRealtimeFormat and (file split, record reader). Decouple 
Hudi logic from the MapredParquetInputFormat. I think we can maintain the Hudi 
file split and path filtering in a central place, and able to be adopted by 
different query engines. With bootstrap support, the file format maintenance 
could be more complicated. I think this is very essential. 
 * Implement the extension of ParquetInputFormat from Spark or a custom data 
source reader to handle merge. 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala]
 * Use Datasource V2 to be the default data source. 

Please let me know what you guys think. 

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-105) DeltaStreamer Kafka Ingestion does not handle invalid offsets



[ 
https://issues.apache.org/jira/browse/HUDI-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083628#comment-17083628
 ] 

Vinoth Chandar commented on HUDI-105:
-

Sorry.. missed this.. yes this fix should be there in every apache release we 
have made.. 

does the pipeline stall with this? can you try again with 0.5.2 and file a new 
JIRA with details on reproducing..

> DeltaStreamer Kafka Ingestion does not handle invalid offsets
> -
>
> Key: HUDI-105
> URL: https://issues.apache.org/jira/browse/HUDI-105
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Usability, Utilities
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Reported here [https://github.com/apache/incubator-hudi/issues/643] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-784) CorruptedLogFileException sometimes happens on GCS



 [ 
https://issues.apache.org/jira/browse/HUDI-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-784:

Labels: pull-request-available  (was: )

> CorruptedLogFileException sometimes happens on GCS
> --
>
> Key: HUDI-784
> URL: https://issues.apache.org/jira/browse/HUDI-784
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>
> 768726 [Executor task launch worker-2] ERROR 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner  - Got 
> exception when reading log file
> org.apache.hudi.exception.CorruptedLogFileException: HoodieLogFile{pathStr='
> [gs://.log.|gs://1_20200219014757.log.2]
> ', fileLen=0}could not be read. Did not find the magic bytes at the start of 
> the block
> at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.readMagic(HoodieLogFileReader.java:313)
>   at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.hasNext(HoodieLogFileReader.java:295)
>   at 
> org.apache.hudi.common.table.log.HoodieLogFormatReader.hasNext(HoodieLogFormatReader.java:103)
>  
> I did extensive debugging and still unclear on why it is happening. It might 
> be issue with GCS libraries themselves. The fix that is working:
>  
> In: HoodieLogFileReader made 
> {code:java}
> // private final byte[] magicBuffer = new byte[6];
> {code}
> non static. I'm not sure why it is actually static in the first place as it 
> is inviting a race.
> Also in HoodieLogFileReader:
> added
> {code:java}
> // fsDataInputStream.seek(0);
> {code}
> added right after stream creation in the constructor.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] afilipchik opened a new pull request #1516: [HUDI-784] Adressing issue with log reader on GCS

afilipchik opened a new pull request #1516: [HUDI-784] Adressing issue with log 
reader on GCS
URL: https://github.com/apache/incubator-hudi/pull/1516
 
 
   ## What is the purpose of the pull request
   
   This one was hard to reproduce and debug, but proposed fix gets rid of the 
error. Still not sure whether it is issue with GCS library or static array. 
   
   ## Brief change log
   
 - Modifed HoodieLogFileReader.
   
   ## Verify this pull request
   Not sure what the test should be...
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated (644c1cc -> 14d4fea)

2020-04-14 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 644c1cc  [HUDI-698]Add unit test for CleansCommand (#1449)
 add 14d4fea  [HUDI-759] Integrate checkpoint provider with delta streamer 
(#1486)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/utilities/UtilHelpers.java | 30 
 .../checkpointing/InitialCheckPointProvider.java   | 33 -
 .../checkpointing/KafkaConnectHdfsProvider.java| 27 ++
 .../hudi/utilities/deltastreamer/DeltaSync.java|  9 ++--
 .../deltastreamer/HoodieDeltaStreamer.java | 57 +-
 .../hudi/utilities/TestHoodieDeltaStreamer.java| 34 ++---
 .../TestKafkaConnectHdfsProvider.java  | 19 +---
 7 files changed, 146 insertions(+), 63 deletions(-)

[GitHub] [incubator-hudi] vinothchandar merged pull request #1486: [HUDI-759] Integrate checkpoint provider with delta streamer

vinothchandar merged pull request #1486: [HUDI-759] Integrate checkpoint 
provider with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-781) Re-design test utilities



[ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083624#comment-17083624
 ] 

Vinoth Chandar commented on HUDI-781:
-

+1 ... I would also add having the unit tests mirror the package structure..

What you ll find is that we have mostly functional style tests.. (that where 
you invest your time, if its scarce.. probably will make the same choice if I 
am in those shoes again :))

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] afilipchik opened a new pull request #1515: [HUDI-795] Ignoring missing aux folde

afilipchik opened a new pull request #1515: [HUDI-795] Ignoring missing aux 
folde
URL: https://github.com/apache/incubator-hudi/pull/1515
 
 
   ## What is the purpose of the pull request
   
   Seeing breakages on GCS. When something removes the folder 
deleteAllInstantsOlderorEqualsInAuxMetaFolder breaks.
   
   ## Brief change log
   
 - Modifed 
HoodieCommitArchiveLog.deleteAllInstantsOlderorEqualsInAuxMetaFolder to ignore 
missing folder
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-795) HoodieCommitArchiveLog.deleteAllInstantsOlderorEqualsInAuxMetaFolder breaks when aux is not present



 [ 
https://issues.apache.org/jira/browse/HUDI-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-795:

Labels: pull-request-available  (was: )

> HoodieCommitArchiveLog.deleteAllInstantsOlderorEqualsInAuxMetaFolder breaks 
> when aux is not present 
> 
>
> Key: HUDI-795
> URL: https://issues.apache.org/jira/browse/HUDI-795
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Seeing this on GCS. Something removes aux folder and then delta streamer 
> fails on this call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-795) HoodieCommitArchiveLog.deleteAllInstantsOlderorEqualsInAuxMetaFolder breaks when aux is not present

Alexander Filipchik created HUDI-795:


 Summary: 
HoodieCommitArchiveLog.deleteAllInstantsOlderorEqualsInAuxMetaFolder breaks 
when aux is not present 
 Key: HUDI-795
 URL: https://issues.apache.org/jira/browse/HUDI-795
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Alexander Filipchik
 Fix For: 0.6.0


Seeing this on GCS. Something removes aux folder and then delta streamer fails 
on this call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408447606
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   Incorrect avro schema is a disaster waiting to happen. Current one says: it 
takes "null" or String, but no defaults. Given that it is serialized in both 
log and parquet files it has to be correct, as if we add new field it will not 
be readable. Here is example exception when message was written without a field 
and then attempted to be read with field that doesn't have default:
   `Caused by: org.apache.avro.AvroTypeException: Found Event, expecting Event, 
missing required field exist
at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at 
org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408447606
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   Incorrect avro schema is a disaster waiting to happen. Current one says: it 
takes "null" or String, but if you try to read the message with reader schema 
that doesn't have those fields it will break as default will not be provided. 
Given that it is serialized in both log and parquet files it has to be correct, 
as if we add new field it will not be readable. Here is example exception when 
message was written without a field and  attempted to be read with field that 
doesn't have default:
   `Caused by: org.apache.avro.AvroTypeException: Found Event, expecting Event, 
missing required field exist
at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at 
org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408445302
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord 
record, LinkedHashSet
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
   if (record.get(f.name()) == null) {
-newRecord.put(f.name(), f.defaultVal());
+if (f.defaultVal() instanceof Null) {
+  newRecord.put(f.name(), null);
 
 Review comment:
   But these tests are passing on the master branch without the patch that you 
did. :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408445302
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord 
record, LinkedHashSet
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
   if (record.get(f.name()) == null) {
-newRecord.put(f.name(), f.defaultVal());
+if (f.defaultVal() instanceof Null) {
+  newRecord.put(f.name(), null);
 
 Review comment:
   But these tests are passing on the master branch without the patch that you 
did. :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408443108
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord 
record, LinkedHashSet
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
   if (record.get(f.name()) == null) {
-newRecord.put(f.name(), f.defaultVal());
+if (f.defaultVal() instanceof Null) {
+  newRecord.put(f.name(), null);
 
 Review comment:
   the one that was modified in this PR: TestHoodieAvroUtils.java. I updated 
test avro schema. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-774) Spark to Avro converter incorrectly generates optional fields



 [ 
https://issues.apache.org/jira/browse/HUDI-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-774:

Labels: pull-request-available  (was: )

> Spark to Avro converter incorrectly generates optional fields
> -
>
> Key: HUDI-774
> URL: https://issues.apache.org/jira/browse/HUDI-774
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>
> I think https://issues.apache.org/jira/browse/SPARK-28008 is a good 
> descriptions of what is happening.
>  
> It can cause a situation when schema in the MOR log files is incompatible 
> with the schema produced by RowBasedSchemaProvider, so compactions will stall.
>  
> I have a fix which is a bit hacky -> postprocess schema produced by the 
> converter and
> 1) Make sure unions with null types have those null types at position 0
> 2) They have default values set to null
> I couldn't find a way to do a clean fix as some classes that are problematic 
> are from Hive and called from Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] afilipchik opened a new pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

afilipchik opened a new pull request #1514: [HUDI-774] Addressing incorrect 
Spark to Avro schema generation
URL: https://github.com/apache/incubator-hudi/pull/1514
 
 
   ## What is the purpose of the pull request
   
   Avro schema that is generated from Spark Dataframe (after SQL 
transformation) has incorrect defaults. Order of types is wrong ("null" must be 
first) and default is not set. If makes every schema change incompatible, which 
is only affecting compactions. 
   
   This one is a reference PR. The solution is a bit hacky as I modify schema 
in place using reflection, there is probably a safer way to rebuild the schema. 
   
   ## Brief change log
   
   *(for example:)*
 - Added HoodieAvroUtils.rewriteIncorrectDefaults
 - Updated AvroConversionHelper and AvroConversionUtils to use it after 
SchemaConverters.toAvroType
   
   ## Verify this pull request
   
   Added test to HoodieAvroUtilsTest to verify rewriter works properly.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408436224
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   Ok.. I am still trying to understand what difference does it make if the 
schema is generated without defaultValue or if it is generated with 
defaultValue as null. Can you please provide more points to support this? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408435191
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord 
record, LinkedHashSet
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
   if (record.get(f.name()) == null) {
-newRecord.put(f.name(), f.defaultVal());
+if (f.defaultVal() instanceof Null) {
+  newRecord.put(f.name(), null);
 
 Review comment:
   Which test you are talking about? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408431569
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   it doesn't generate correct avro schema. This might not be the correct fix, 
but passing (Object) null results is the schema without default value.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408430839
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord 
record, LinkedHashSet
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
   if (record.get(f.name()) == null) {
-newRecord.put(f.name(), f.defaultVal());
+if (f.defaultVal() instanceof Null) {
+  newRecord.put(f.name(), null);
 
 Review comment:
   just run the test without it. You will get an exception. 
   if schema has "default": null it breaks, which should not be the case. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-791) Replace null by Option in Delta Streamer



[ 
https://issues.apache.org/jira/browse/HUDI-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083601#comment-17083601
 ] 

Pratyaksh Sharma commented on HUDI-791:
---

Nice initiative. :) 

> Replace null by Option in Delta Streamer
> 
>
> Key: HUDI-791
> URL: https://issues.apache.org/jira/browse/HUDI-791
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, newbie
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> There is a lot of null in Delta Streamer. That will be great if we can 
> replace those null by Option. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1512: [HUDI-763] Add hoodie.table.base.file.format option to hoodie.properties file

pratyakshsharma commented on issue #1512: [HUDI-763] Add 
hoodie.table.base.file.format option to hoodie.properties file
URL: https://github.com/apache/incubator-hudi/pull/1512#issuecomment-613673881
 
 
   > @lamber-ken trying to understand why this should be needed.. can't we 
distinguish based on the input format for ORC, where we can create 
HoodieOrcInputFormat ..
   
   +1 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408425373
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -206,7 +207,11 @@ private static GenericRecord rewrite(GenericRecord 
record, LinkedHashSet
 GenericRecord newRecord = new GenericData.Record(newSchema);
 for (Schema.Field f : fieldsToWrite) {
   if (record.get(f.name()) == null) {
-newRecord.put(f.name(), f.defaultVal());
+if (f.defaultVal() instanceof Null) {
+  newRecord.put(f.name(), null);
 
 Review comment:
   Why do you need this? If f.defaultVal() is null, it will be automatically 
populated like that. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

pratyakshsharma commented on a change in pull request #1513: [HUDI-793] Adding 
proper default to hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513#discussion_r408423175
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -104,15 +105,15 @@ public static Schema addMetadataFields(Schema schema) {
 List parentFields = new ArrayList<>();
 
 Schema.Field commitTimeField =
-new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field commitSeqnoField =
-new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field recordKeyField =
-new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field partitionPathField =
-new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.PARTITION_PATH_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 Schema.Field fileNameField =
-new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null);
+new Schema.Field(HoodieRecord.FILENAME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", NullNode.getInstance());
 
 Review comment:
   By doing this change, we are effectively going back to deprecated 
constructors again. Can you please explain the purpose of doing this? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-794) Add support for maintaining separate table wise configs in HoodieDeltaStreamer similar to HoodieMultiTableDeltaStreamer



 [ 
https://issues.apache.org/jira/browse/HUDI-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma updated HUDI-794:
--
Status: Open  (was: New)

> Add support for maintaining separate table wise configs in 
> HoodieDeltaStreamer similar to HoodieMultiTableDeltaStreamer
> ---
>
> Key: HUDI-794
> URL: https://issues.apache.org/jira/browse/HUDI-794
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.6.0
>
>
> Basically this requires introduction of --config-folder in 
> HoodieDeltaStreamer.Config class similar to how we have in 
> HoodieMultiTableDeltaStreamer.Config class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-794) Add support for maintaining separate table wise configs in HoodieDeltaStreamer similar to HoodieMultiTableDeltaStreamer

Pratyaksh Sharma created HUDI-794:
-

 Summary: Add support for maintaining separate table wise configs 
in HoodieDeltaStreamer similar to HoodieMultiTableDeltaStreamer
 Key: HUDI-794
 URL: https://issues.apache.org/jira/browse/HUDI-794
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: Pratyaksh Sharma
Assignee: Pratyaksh Sharma
 Fix For: 0.6.0


Basically this requires introduction of --config-folder in 
HoodieDeltaStreamer.Config class similar to how we have in 
HoodieMultiTableDeltaStreamer.Config class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-427) Implement CLI support for performing bootstrap

2020-04-14 Thread Balaji Varadarajan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083578#comment-17083578
 ] 

Balaji Varadarajan commented on HUDI-427:
-

Also need CLI support to Inspect bootrap index at table and partition level

> Implement CLI support for performing bootstrap
> --
>
> Key: HUDI-427
> URL: https://issues.apache.org/jira/browse/HUDI-427
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: CLI
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Major
> Fix For: 0.6.0
>
>
> Need CLI to perform bootstrap as described in 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1433: [HUDI-728]: Implement custom key generator

pratyakshsharma commented on issue #1433: [HUDI-728]: Implement custom key 
generator
URL: https://github.com/apache/incubator-hudi/pull/1433#issuecomment-613655751
 
 
   @vinothchandar @nsivabalan please take pass. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps 
repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r407905954
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -322,66 +347,94 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
   /**
* Helper method to facilitate performing mutations (including puts and 
deletes) in Hbase.
*/
-  private void doMutations(BufferedMutator mutator, List mutations) 
throws IOException {
+  private void doMutations(BufferedMutator mutator, List mutations, 
RateLimiter limiter) throws IOException {
 if (mutations.isEmpty()) {
   return;
 }
+// report number of operations to account per second with rate limiter.
+// If #limiter.getRate() operations are acquired within 1 second, 
ratelimiter will limit the rest of calls
+// for within that second
+limiter.acquire(mutations.size());
 mutator.mutate(mutations);
 mutator.flush();
 mutations.clear();
-sleepForTime(SLEEP_TIME_MILLISECONDS);
-  }
-
-  private static void sleepForTime(int sleepTimeMs) {
-try {
-  Thread.sleep(sleepTimeMs);
-} catch (InterruptedException e) {
-  LOG.error("Sleep interrupted during throttling", e);
-  throw new RuntimeException(e);
-}
   }
 
   @Override
   public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
   HoodieTable hoodieTable) {
-final HBaseIndexQPSResourceAllocator hBaseIndexQPSResourceAllocator = 
createQPSResourceAllocator(this.config);
-setPutBatchSize(writeStatusRDD, hBaseIndexQPSResourceAllocator, jsc);
-LOG.info("multiPutBatchSize: before hbase puts" + multiPutBatchSize);
-JavaRDD writeStatusJavaRDD = 
writeStatusRDD.mapPartitionsWithIndex(updateLocationFunction(), true);
+final Option desiredQPSFraction =  
calculateQPSFraction(writeStatusRDD, hBaseIndexQPSResourceAllocator);
+// Map each fileId that has inserts to a unique partition Id. This will be 
used while
+// repartitioning RDD
+int partitionIndex = 0;
+final List fileIds = writeStatusRDD.filter(w -> 
w.getStat().getNumInserts() > 0)
+   .map(w -> w.getFileId()).collect();
+for (final String fileId : fileIds) {
+  this.fileIdPartitionMap.put(fileId, partitionIndex++);
+}
+JavaRDD partitionedRDD = this.numWriteStatusWithInserts == 0 
? writeStatusRDD :
+  writeStatusRDD.mapToPair(w -> new 
Tuple2<>(w.getFileId(), w))
+.partitionBy(new 
WriteStatusPartitioner(this.fileIdPartitionMap,
+  this.numWriteStatusWithInserts))
+.map(w -> w._2());
 
 Review comment:
   Consider redoing this logic, because if this.numWriteStatusWithInserts == 0 
, we still go through the process of generating fileIdPartitionMap which is not 
ideal.
   
   Also, curious, if you did any performance measurements before and after this 
change. It is worth highlighting in release notes if this improvement is 
significant


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps 
repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r407869565
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -322,66 +347,94 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
   /**
* Helper method to facilitate performing mutations (including puts and 
deletes) in Hbase.
*/
-  private void doMutations(BufferedMutator mutator, List mutations) 
throws IOException {
+  private void doMutations(BufferedMutator mutator, List mutations, 
RateLimiter limiter) throws IOException {
 if (mutations.isEmpty()) {
   return;
 }
+// report number of operations to account per second with rate limiter.
+// If #limiter.getRate() operations are acquired within 1 second, 
ratelimiter will limit the rest of calls
+// for within that second
+limiter.acquire(mutations.size());
 mutator.mutate(mutations);
 mutator.flush();
 mutations.clear();
-sleepForTime(SLEEP_TIME_MILLISECONDS);
-  }
-
-  private static void sleepForTime(int sleepTimeMs) {
-try {
-  Thread.sleep(sleepTimeMs);
-} catch (InterruptedException e) {
-  LOG.error("Sleep interrupted during throttling", e);
-  throw new RuntimeException(e);
-}
   }
 
   @Override
   public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
   HoodieTable hoodieTable) {
-final HBaseIndexQPSResourceAllocator hBaseIndexQPSResourceAllocator = 
createQPSResourceAllocator(this.config);
-setPutBatchSize(writeStatusRDD, hBaseIndexQPSResourceAllocator, jsc);
-LOG.info("multiPutBatchSize: before hbase puts" + multiPutBatchSize);
-JavaRDD writeStatusJavaRDD = 
writeStatusRDD.mapPartitionsWithIndex(updateLocationFunction(), true);
+final Option desiredQPSFraction =  
calculateQPSFraction(writeStatusRDD, hBaseIndexQPSResourceAllocator);
+// Map each fileId that has inserts to a unique partition Id. This will be 
used while
+// repartitioning RDD
+int partitionIndex = 0;
+final List fileIds = writeStatusRDD.filter(w -> 
w.getStat().getNumInserts() > 0)
+   .map(w -> w.getFileId()).collect();
+for (final String fileId : fileIds) {
+  this.fileIdPartitionMap.put(fileId, partitionIndex++);
+}
+JavaRDD partitionedRDD = this.numWriteStatusWithInserts == 0 
? writeStatusRDD :
+  writeStatusRDD.mapToPair(w -> new 
Tuple2<>(w.getFileId(), w))
+.partitionBy(new 
WriteStatusPartitioner(this.fileIdPartitionMap,
+  this.numWriteStatusWithInserts))
+.map(w -> w._2());
+acquireQPSResourcesAndSetBatchSize(desiredQPSFraction, jsc);
+LOG.info("multiPutBatchSize before hbase puts: " + this.multiPutBatchSize);
 
 Review comment:
   nit: looks like this is logged in the above method call too. so i think this 
can be removed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps 
repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r407805506
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -322,66 +347,94 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
   /**
* Helper method to facilitate performing mutations (including puts and 
deletes) in Hbase.
*/
-  private void doMutations(BufferedMutator mutator, List mutations) 
throws IOException {
+  private void doMutations(BufferedMutator mutator, List mutations, 
RateLimiter limiter) throws IOException {
 if (mutations.isEmpty()) {
   return;
 }
+// report number of operations to account per second with rate limiter.
+// If #limiter.getRate() operations are acquired within 1 second, 
ratelimiter will limit the rest of calls
+// for within that second
+limiter.acquire(mutations.size());
 mutator.mutate(mutations);
 mutator.flush();
 mutations.clear();
-sleepForTime(SLEEP_TIME_MILLISECONDS);
-  }
-
-  private static void sleepForTime(int sleepTimeMs) {
-try {
-  Thread.sleep(sleepTimeMs);
-} catch (InterruptedException e) {
-  LOG.error("Sleep interrupted during throttling", e);
-  throw new RuntimeException(e);
-}
   }
 
   @Override
   public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
   HoodieTable hoodieTable) {
-final HBaseIndexQPSResourceAllocator hBaseIndexQPSResourceAllocator = 
createQPSResourceAllocator(this.config);
-setPutBatchSize(writeStatusRDD, hBaseIndexQPSResourceAllocator, jsc);
-LOG.info("multiPutBatchSize: before hbase puts" + multiPutBatchSize);
-JavaRDD writeStatusJavaRDD = 
writeStatusRDD.mapPartitionsWithIndex(updateLocationFunction(), true);
+final Option desiredQPSFraction =  
calculateQPSFraction(writeStatusRDD, hBaseIndexQPSResourceAllocator);
+// Map each fileId that has inserts to a unique partition Id. This will be 
used while
+// repartitioning RDD
+int partitionIndex = 0;
+final List fileIds = writeStatusRDD.filter(w -> 
w.getStat().getNumInserts() > 0)
+   .map(w -> w.getFileId()).collect();
+for (final String fileId : fileIds) {
+  this.fileIdPartitionMap.put(fileId, partitionIndex++);
+}
+JavaRDD partitionedRDD = this.numWriteStatusWithInserts == 0 
? writeStatusRDD :
+  writeStatusRDD.mapToPair(w -> new 
Tuple2<>(w.getFileId(), w))
+.partitionBy(new 
WriteStatusPartitioner(this.fileIdPartitionMap,
+  this.numWriteStatusWithInserts))
+.map(w -> w._2());
+acquireQPSResourcesAndSetBatchSize(desiredQPSFraction, jsc);
+LOG.info("multiPutBatchSize before hbase puts: " + this.multiPutBatchSize);
+JavaRDD writeStatusJavaRDD = 
partitionedRDD.mapPartitionsWithIndex(updateLocationFunction(),
+true);
 // caching the index updated status RDD
 writeStatusJavaRDD = 
writeStatusJavaRDD.persist(SparkConfigUtils.getWriteStatusStorageLevel(config.getProps()));
+// force trigger update location(hbase puts)
+writeStatusJavaRDD.count();
+this.hBaseIndexQPSResourceAllocator.releaseQPSResources();
 
 Review comment:
   Can you help me understand this code? why do we need to force trigger here? 
Is this just to releaseQPSResources? releaseQPSResources seems to be doing 
nothing (at least default implementation, are there other implementations 
outside hoodie?). Is it really important to release here as opposed to doing it 
in 'close()' (earlier behavior)?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps 
repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r408345007
 
 

 ##
 File path: hudi-client/src/test/java/org/apache/hudi/index/TestHbaseIndex.java
 ##
 @@ -329,47 +332,140 @@ public void testPutBatchSizeCalculation() {
 // All asserts cases below are derived out of the first
 // example below, with change in one parameter at a time.
 
-int putBatchSize = batchSizeCalculator.getBatchSize(10, 16667, 1200, 200, 
100, 0.1f);
-// Expected batchSize is 8 because in that case, total request sent in one 
second is below
-// 8 (batchSize) * 200 (parallelism) * 10 (maxReqsInOneSecond) * 10 
(numRegionServers) * 0.1 (qpsFraction)) => 16000
-// We assume requests get distributed to Region Servers uniformly, so each 
RS gets 1600 request
-// 1600 happens to be 10% of 16667 (maxQPSPerRegionServer) as expected.
-Assert.assertEquals(putBatchSize, 8);
+int putBatchSize = batchSizeCalculator.getBatchSize(10, 16667, 1200, 200, 
0.1f);
+// Total puts that can be sent  in 1 second = (10 * 16667 * 0.1) = 16,667
+// Total puts per batch will be (16,667 / parallelism) = 83.335, where 200 
is the maxExecutors
+Assert.assertEquals(putBatchSize, 83);
 
 // Number of Region Servers are halved, total requests sent in a second 
are also halved, so batchSize is also halved
-int putBatchSize2 = batchSizeCalculator.getBatchSize(5, 16667, 1200, 200, 
100, 0.1f);
-Assert.assertEquals(putBatchSize2, 4);
+int putBatchSize2 = batchSizeCalculator.getBatchSize(5, 16667, 1200, 200, 
0.1f);
+Assert.assertEquals(putBatchSize2, 41);
 
 // If the parallelism is halved, batchSize has to double
-int putBatchSize3 = batchSizeCalculator.getBatchSize(10, 16667, 1200, 100, 
100, 0.1f);
-Assert.assertEquals(putBatchSize3, 16);
+int putBatchSize3 = batchSizeCalculator.getBatchSize(10, 16667, 1200, 100, 
0.1f);
+Assert.assertEquals(putBatchSize3, 166);
 
 // If the parallelism is halved, batchSize has to double.
 // This time parallelism is driven by numTasks rather than numExecutors
-int putBatchSize4 = batchSizeCalculator.getBatchSize(10, 16667, 100, 200, 
100, 0.1f);
-Assert.assertEquals(putBatchSize4, 16);
+int putBatchSize4 = batchSizeCalculator.getBatchSize(10, 16667, 100, 200, 
0.1f);
+Assert.assertEquals(putBatchSize4, 166);
 
 // If sleepTimeMs is halved, batchSize has to halve
-int putBatchSize5 = batchSizeCalculator.getBatchSize(10, 16667, 1200, 200, 
100, 0.05f);
-Assert.assertEquals(putBatchSize5, 4);
+int putBatchSize5 = batchSizeCalculator.getBatchSize(10, 16667, 1200, 200, 
0.05f);
+Assert.assertEquals(putBatchSize5, 41);
 
 // If maxQPSPerRegionServer is doubled, batchSize also doubles
-int putBatchSize6 = batchSizeCalculator.getBatchSize(10, 4, 1200, 200, 
100, 0.1f);
-Assert.assertEquals(putBatchSize6, 16);
+int putBatchSize6 = batchSizeCalculator.getBatchSize(10, 4, 1200, 200, 
0.1f);
+Assert.assertEquals(putBatchSize6, 166);
   }
 
   @Test
   public void testsHBasePutAccessParallelism() {
 HoodieWriteConfig config = getConfig();
 HBaseIndex index = new HBaseIndex(config);
 final JavaRDD writeStatusRDD = jsc.parallelize(
-Arrays.asList(getSampleWriteStatus(1, 2), getSampleWriteStatus(0, 3), 
getSampleWriteStatus(10, 0)), 10);
+Arrays.asList(
+  getSampleWriteStatus(0, 2),
+  getSampleWriteStatus(2, 3),
+  getSampleWriteStatus(4, 3),
+  getSampleWriteStatus(6, 3),
+  getSampleWriteStatus(8, 0)),
+10);
 final Tuple2 tuple = 
index.getHBasePutAccessParallelism(writeStatusRDD);
 final int hbasePutAccessParallelism = 
Integer.parseInt(tuple._2.toString());
 final int hbaseNumPuts = Integer.parseInt(tuple._1.toString());
 Assert.assertEquals(10, writeStatusRDD.getNumPartitions());
-Assert.assertEquals(2, hbasePutAccessParallelism);
-Assert.assertEquals(11, hbaseNumPuts);
+Assert.assertEquals(4, hbasePutAccessParallelism);
+Assert.assertEquals(20, hbaseNumPuts);
+  }
+
+  @Test
+  public void testsWriteStatusPartitioner() {
+HoodieWriteConfig config = getConfig();
+HBaseIndex index = new HBaseIndex(config);
+int parallelism = 4;
+final JavaRDD writeStatusRDD = jsc.parallelize(
+Arrays.asList(
+  getSampleWriteStatusWithFileId(0, 2),
+  getSampleWriteStatusWithFileId(2, 3),
+  getSampleWriteStatusWithFileId(4, 3),
+  getSampleWriteStatusWithFileId(0, 3),
+  getSampleWriteStatusWithFileId(11, 0)), parallelism);
+int partitionIndex = 0;
+final Map fileIdPartitionMap = new HashMap<>();
+
+final List fileIds = writeStatusRDD.filter(w -> 
w.getStat().getNumInserts() > 0)
 
 Review comment:
   lot of code in this test seems like repetition from source code. consider

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps 
repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r407807929
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -322,66 +347,94 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
   /**
* Helper method to facilitate performing mutations (including puts and 
deletes) in Hbase.
*/
-  private void doMutations(BufferedMutator mutator, List mutations) 
throws IOException {
+  private void doMutations(BufferedMutator mutator, List mutations, 
RateLimiter limiter) throws IOException {
 if (mutations.isEmpty()) {
   return;
 }
+// report number of operations to account per second with rate limiter.
+// If #limiter.getRate() operations are acquired within 1 second, 
ratelimiter will limit the rest of calls
+// for within that second
+limiter.acquire(mutations.size());
 mutator.mutate(mutations);
 mutator.flush();
 mutations.clear();
-sleepForTime(SLEEP_TIME_MILLISECONDS);
-  }
-
-  private static void sleepForTime(int sleepTimeMs) {
-try {
-  Thread.sleep(sleepTimeMs);
-} catch (InterruptedException e) {
-  LOG.error("Sleep interrupted during throttling", e);
-  throw new RuntimeException(e);
-}
   }
 
   @Override
   public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
   HoodieTable hoodieTable) {
-final HBaseIndexQPSResourceAllocator hBaseIndexQPSResourceAllocator = 
createQPSResourceAllocator(this.config);
-setPutBatchSize(writeStatusRDD, hBaseIndexQPSResourceAllocator, jsc);
-LOG.info("multiPutBatchSize: before hbase puts" + multiPutBatchSize);
-JavaRDD writeStatusJavaRDD = 
writeStatusRDD.mapPartitionsWithIndex(updateLocationFunction(), true);
+final Option desiredQPSFraction =  
calculateQPSFraction(writeStatusRDD, hBaseIndexQPSResourceAllocator);
+// Map each fileId that has inserts to a unique partition Id. This will be 
used while
+// repartitioning RDD
+int partitionIndex = 0;
+final List fileIds = writeStatusRDD.filter(w -> 
w.getStat().getNumInserts() > 0)
+   .map(w -> w.getFileId()).collect();
+for (final String fileId : fileIds) {
+  this.fileIdPartitionMap.put(fileId, partitionIndex++);
+}
+JavaRDD partitionedRDD = this.numWriteStatusWithInserts == 0 
? writeStatusRDD :
+  writeStatusRDD.mapToPair(w -> new 
Tuple2<>(w.getFileId(), w))
+.partitionBy(new 
WriteStatusPartitioner(this.fileIdPartitionMap,
+  this.numWriteStatusWithInserts))
+.map(w -> w._2());
+acquireQPSResourcesAndSetBatchSize(desiredQPSFraction, jsc);
+LOG.info("multiPutBatchSize before hbase puts: " + this.multiPutBatchSize);
+JavaRDD writeStatusJavaRDD = 
partitionedRDD.mapPartitionsWithIndex(updateLocationFunction(),
+true);
 // caching the index updated status RDD
 writeStatusJavaRDD = 
writeStatusJavaRDD.persist(SparkConfigUtils.getWriteStatusStorageLevel(config.getProps()));
+// force trigger update location(hbase puts)
+writeStatusJavaRDD.count();
+this.hBaseIndexQPSResourceAllocator.releaseQPSResources();
 return writeStatusJavaRDD;
   }
 
-  private void setPutBatchSize(JavaRDD writeStatusRDD,
-  HBaseIndexQPSResourceAllocator hBaseIndexQPSResourceAllocator, final 
JavaSparkContext jsc) {
+  private Option calculateQPSFraction(JavaRDD 
writeStatusRDD,
+   HBaseIndexQPSResourceAllocator 
hBaseIndexQPSResourceAllocator) {
 
 Review comment:
   Not directly related to your change, so feel free to ignore this comment. 
But hBaseIndexQPSResourceAllocator  is instance variable. why is this again 
passed as argument. This seems like a consistent pattern in this class. Because 
we are also using exact same name for local variable, it masks instance 
variable and can become easily error prone if the two variables evolve to mean 
different things.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps 
repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r407868223
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -498,4 +554,37 @@ public boolean isImplicitWithStorage() {
   public void setHbaseConnection(Connection hbaseConnection) {
 HBaseIndex.hbaseConnection = hbaseConnection;
   }
+
+  /**
+   * Partitions each WriteStatus with inserts into a unique single partition. 
WriteStatus without inserts will be
+   * assigned to random partitions. This partitioner will be useful to utilize 
max parallelism with spark operations
+   * that are based on inserts in each WriteStatus.
+   */
+  public class WriteStatusPartitioner extends Partitioner {
 
 Review comment:
   Could you please make this a static class if its not using any instance 
variables of outer class


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps 
repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r407766436
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -83,13 +88,17 @@
   private static final byte[] COMMIT_TS_COLUMN = Bytes.toBytes("commit_ts");
   private static final byte[] FILE_NAME_COLUMN = Bytes.toBytes("file_name");
   private static final byte[] PARTITION_PATH_COLUMN = 
Bytes.toBytes("partition_path");
-  private static final int SLEEP_TIME_MILLISECONDS = 100;
 
   private static final Logger LOG = LogManager.getLogger(HBaseIndex.class);
   private static Connection hbaseConnection = null;
   private HBaseIndexQPSResourceAllocator hBaseIndexQPSResourceAllocator = null;
   private float qpsFraction;
   private int maxQpsPerRegionServer;
+  private int maxPutsPerSec;
 
 Review comment:
   I dont see how this is used. could you please add a comment for all these 
instance variables? It seems like they can be local variables specific to the 
operation being performed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps 
repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r407812837
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -83,13 +88,17 @@
   private static final byte[] COMMIT_TS_COLUMN = Bytes.toBytes("commit_ts");
   private static final byte[] FILE_NAME_COLUMN = Bytes.toBytes("file_name");
   private static final byte[] PARTITION_PATH_COLUMN = 
Bytes.toBytes("partition_path");
-  private static final int SLEEP_TIME_MILLISECONDS = 100;
 
   private static final Logger LOG = LogManager.getLogger(HBaseIndex.class);
   private static Connection hbaseConnection = null;
   private HBaseIndexQPSResourceAllocator hBaseIndexQPSResourceAllocator = null;
 
 Review comment:
   Can you share the context on why we created HBaseIndexQPSResourceAllocator?  
Do you think calls to RateLimiter#acquire can be made inside 
HBaseIndexQPSResourceAllocator#acquireQPSResources to simplify? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps 
repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r408328994
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -252,8 +263,10 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
 };
   }
 
-  private Result[] doGet(HTable hTable, List keys) throws IOException {
-sleepForTime(SLEEP_TIME_MILLISECONDS);
+  private Result[] doGet(HTable hTable, List keys, RateLimiter limiter) 
throws IOException {
+if (keys.size() > 0) {
 
 Review comment:
   nit: can we also move hTable.get(keys) inside this if?  do we need to invoke 
hTable.get if keys is empty?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

satishkotha commented on a change in pull request #1484: [HUDI-316] : Hbase qps 
repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r407815204
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/RateLimiter.java
 ##
 @@ -0,0 +1,245 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import java.util.concurrent.TimeUnit;
+import javax.annotation.Nullable;
+import javax.annotation.concurrent.ThreadSafe;
+
+/*
+ * Note: Based on RateLimiter implementation in Google/Guava.
+ * - adopted from com.google.common.util.concurrent
+ *   Copyright (C) 2012 The Guava Authors
+ *   Home page: https://github.com/google/guava
+ *   License: http://www.apache.org/licenses/LICENSE-2.0
+ */
+
+@ThreadSafe
+public abstract class RateLimiter {
+  private final RateLimiter.SleepingTicker ticker;
+  private final long offsetNanos;
+  double storedPermits;
+  double maxPermits;
+  volatile double stableIntervalMicros;
+  private final Object mutex;
+  private long nextFreeTicketMicros;
+
+  public static RateLimiter create(double permitsPerSecond) {
+return create(RateLimiter.SleepingTicker.SYSTEM_TICKER, permitsPerSecond);
+  }
+
+  static RateLimiter create(RateLimiter.SleepingTicker ticker, double 
permitsPerSecond) {
+RateLimiter rateLimiter = new RateLimiter.Bursty(ticker, 1.0D);
+rateLimiter.setRate(permitsPerSecond);
+return rateLimiter;
+  }
+
+  public static RateLimiter create(double permitsPerSecond, long warmupPeriod, 
TimeUnit unit) {
+return create(RateLimiter.SleepingTicker.SYSTEM_TICKER, permitsPerSecond, 
warmupPeriod, unit);
+  }
+
+  static RateLimiter create(RateLimiter.SleepingTicker ticker, double 
permitsPerSecond, long warmupPeriod, TimeUnit unit) {
+RateLimiter rateLimiter = new RateLimiter.WarmingUp(ticker, warmupPeriod, 
unit);
+rateLimiter.setRate(permitsPerSecond);
+return rateLimiter;
+  }
+
+  private RateLimiter(RateLimiter.SleepingTicker ticker) {
+this.mutex = new Object();
+this.nextFreeTicketMicros = 0L;
+this.ticker = ticker;
+this.offsetNanos = ticker.read();
+  }
+
+  public final void setRate(double permitsPerSecond) {
+checkArgument(permitsPerSecond > 0.0D && !Double.isNaN(permitsPerSecond), 
"rate must be positive");
+Object var3 = this.mutex;
+synchronized (this.mutex) {
+  this.resync(this.readSafeMicros());
+  double stableIntervalMicros = (double)TimeUnit.SECONDS.toMicros(1L) / 
permitsPerSecond;
+  this.stableIntervalMicros = stableIntervalMicros;
+  this.doSetRate(permitsPerSecond, stableIntervalMicros);
+}
+  }
+
+  abstract void doSetRate(double var1, double var3);
+
+  public final double getRate() {
+return (double)TimeUnit.SECONDS.toMicros(1L) / this.stableIntervalMicros;
+  }
+
+  public void acquire() {
+this.acquire(1);
+  }
+
+  public void acquire(int permits) {
 
 Review comment:
   could you return the time spent waiting here? I think adding metrics on time 
taken is very important for debugging any potential performance issues. Also, 
would be useful to log if time taken is greater than some threshold (say, 
300ms?)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-793) Avro schema for metadata fields is incorrect (no defaults). Defaults handling is broken



 [ 
https://issues.apache.org/jira/browse/HUDI-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Filipchik updated HUDI-793:
-
Summary: Avro schema for metadata fields is incorrect (no defaults). 
Defaults handling is broken  (was: Avro schema for metadata fields is incorrect 
(no defaults). Default's handling is broken)

> Avro schema for metadata fields is incorrect (no defaults). Defaults handling 
> is broken
> ---
>
> Key: HUDI-793
> URL: https://issues.apache.org/jira/browse/HUDI-793
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> HoodieAvroUtils adds hudi related fields to Avros:
> {code:java}
> // Schema.Field commitTimeField =
> new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
> METADATA_FIELD_SCHEMA, "", (Object) null)
> {code}
> It generates:
> {code:java}
> // "fields": [
>   {
> "name": "_hoodie_commit_time",
> "type": [
>   "null",
>   "string"
> ],
> "doc": ""
>   }
> {code}
> and it should generate:
> {code:java}
> // "fields": [
>   {
> "name": "_hoodie_commit_time",
> "type": [
>   "null",
>   "string"
> ],
> "doc": "",
> "default": null  
> }
> {code}
> "default": null is missing
>  
> Also, when incoming schema has "default": null, HoodieAvroUtils.rewrite 
> breaks as it can't properly copy null value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-793) Avro schema for metadata fields is incorrect (no defaults). Default's handling is broken



 [ 
https://issues.apache.org/jira/browse/HUDI-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Filipchik updated HUDI-793:
-
Description: 
HoodieAvroUtils adds hudi related fields to Avros:
{code:java}
// Schema.Field commitTimeField =
new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null)
{code}
It generates:
{code:java}
// "fields": [
  {
"name": "_hoodie_commit_time",
"type": [
  "null",
  "string"
],
"doc": ""
  }
{code}
and it should generate:
{code:java}
// "fields": [
  {
"name": "_hoodie_commit_time",
"type": [
  "null",
  "string"
],
"doc": "",
"default": null  
}
{code}
"default": null is missing

 

Also, when incoming schema has "default": null, HoodieAvroUtils.rewrite breaks 
as it can't properly copy null value

  was:
HoodieAvroUtils adds hudi related fields to Avros: 
{code:java}
// Schema.Field commitTimeField =
new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null)
{code}
It generates: 
{code:java}
// "fields": [
  {
"name": "_hoodie_commit_time",
"type": [
  "null",
  "string"
],
"doc": ""
  }
{code}
and it should generate:
{code:java}
// "fields": [
  {
"name": "_hoodie_commit_time",
"type": [
  "null",
  "string"
],
"doc": "",
"default": null  
}
{code}
"default": null is missing

Summary: Avro schema for metadata fields is incorrect (no defaults). 
Default's handling is broken  (was: Avro schema for metadata fields is 
incorrect (no defaults))

> Avro schema for metadata fields is incorrect (no defaults). Default's 
> handling is broken
> 
>
> Key: HUDI-793
> URL: https://issues.apache.org/jira/browse/HUDI-793
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> HoodieAvroUtils adds hudi related fields to Avros:
> {code:java}
> // Schema.Field commitTimeField =
> new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
> METADATA_FIELD_SCHEMA, "", (Object) null)
> {code}
> It generates:
> {code:java}
> // "fields": [
>   {
> "name": "_hoodie_commit_time",
> "type": [
>   "null",
>   "string"
> ],
> "doc": ""
>   }
> {code}
> and it should generate:
> {code:java}
> // "fields": [
>   {
> "name": "_hoodie_commit_time",
> "type": [
>   "null",
>   "string"
> ],
> "doc": "",
> "default": null  
> }
> {code}
> "default": null is missing
>  
> Also, when incoming schema has "default": null, HoodieAvroUtils.rewrite 
> breaks as it can't properly copy null value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-793) Avro schema for metadata fields is incorrect (no defaults)



 [ 
https://issues.apache.org/jira/browse/HUDI-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-793:

Labels: pull-request-available  (was: )

> Avro schema for metadata fields is incorrect (no defaults)
> --
>
> Key: HUDI-793
> URL: https://issues.apache.org/jira/browse/HUDI-793
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
>
> HoodieAvroUtils adds hudi related fields to Avros: 
> {code:java}
> // Schema.Field commitTimeField =
> new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
> METADATA_FIELD_SCHEMA, "", (Object) null)
> {code}
> It generates: 
> {code:java}
> // "fields": [
>   {
> "name": "_hoodie_commit_time",
> "type": [
>   "null",
>   "string"
> ],
> "doc": ""
>   }
> {code}
> and it should generate:
> {code:java}
> // "fields": [
>   {
> "name": "_hoodie_commit_time",
> "type": [
>   "null",
>   "string"
> ],
> "doc": "",
> "default": null  
> }
> {code}
> "default": null is missing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] afilipchik opened a new pull request #1513: [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine

afilipchik opened a new pull request #1513: [HUDI-793] Adding proper default to 
hudi metadata fields and proper handling to rewrite routine
URL: https://github.com/apache/incubator-hudi/pull/1513
 
 
   
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   Fixing issue with avro and default. Current implementation doesn't set 
correct default fields when adding hudi metadata and also fails if record's 
schema has them set to:
   "dafault": null
   
   ## Brief change log
   
   *(for example:)*
 - *Modified HoodieAvroUtils
   
   ## Verify this pull request
   
   I added default to the already existing test.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (HUDI-793) Avro schema for metadata fields is incorrect (no defaults)

Alexander Filipchik created HUDI-793:


 Summary: Avro schema for metadata fields is incorrect (no defaults)
 Key: HUDI-793
 URL: https://issues.apache.org/jira/browse/HUDI-793
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Alexander Filipchik


HoodieAvroUtils adds hudi related fields to Avros: 
{code:java}
// Schema.Field commitTimeField =
new Schema.Field(HoodieRecord.COMMIT_TIME_METADATA_FIELD, 
METADATA_FIELD_SCHEMA, "", (Object) null)
{code}
It generates: 
{code:java}
// "fields": [
  {
"name": "_hoodie_commit_time",
"type": [
  "null",
  "string"
],
"doc": ""
  }
{code}
and it should generate:
{code:java}
// "fields": [
  {
"name": "_hoodie_commit_time",
"type": [
  "null",
  "string"
],
"doc": "",
"default": null  
}
{code}
"default": null is missing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] afilipchik commented on issue #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

afilipchik commented on issue #1457: [HUDI-741] Added checks to validate 
Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-613613150
 
 
   I found a way to get out compaction situation. Had to modify parquet-avro to 
allow avro style compatibility. Currently if avro doesn't have all the fields 
that are in parquet -> it is exception. I made it so if fields are missing in 
avro, parquet-avro reader will just skip them. Was able to resolve incompatible 
change on our dev env fine.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1509: [HUDI-525] lack of insert info in delta_commit inflight

codecov-io edited a comment on issue #1509: [HUDI-525] lack of insert info in 
delta_commit inflight
URL: https://github.com/apache/incubator-hudi/pull/1509#issuecomment-612629725
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1509?src=pr=h1) 
Report
   > Merging 
[#1509](https://codecov.io/gh/apache/incubator-hudi/pull/1509?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/644c1cc8bd1965271c4433edccb17aa8fda5f403=desc)
 will **increase** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1509/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1509?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1509  +/-   ##
   
   + Coverage 72.17%   72.21%   +0.03% 
 Complexity  289  289  
   
 Files   372  372  
 Lines 1626716272   +5 
 Branches   1637 1637  
   
   + Hits  1174111751  +10 
   + Misses 3791 3787   -4 
   + Partials735  734   -1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1509?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../table/action/commit/BaseCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1509/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9CYXNlQ29tbWl0QWN0aW9uRXhlY3V0b3IuamF2YQ==)
 | `85.47% <100.00%> (+0.64%)` | `0.00 <0.00> (ø)` | |
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1509/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...on/rollback/MergeOnReadRollbackActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1509/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL3JvbGxiYWNrL01lcmdlT25SZWFkUm9sbGJhY2tBY3Rpb25FeGVjdXRvci5qYXZh)
 | `86.11% <0.00%> (+1.38%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1509/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `72.22% <0.00%> (+13.88%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1509/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `80.00% <0.00%> (+40.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1509?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1509?src=pr=footer).
 Last update 
[644c1cc...5f9c174](https://codecov.io/gh/apache/incubator-hudi/pull/1509?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] ahmed-elfar edited a comment on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

ahmed-elfar edited a comment on issue #1498: Migrating parquet table to hudi
issue [SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-613563197

@vinothchandar I apologies for the delayed response, and thanks again for
your help and detailed answers.

> Is it possible to share the data generation tool with us or point us to
reproducing this ourselves locally? We can go much faster if we are able to
repro this ourselves..

Sure, this is the public Repo for generating the data
[https://github.com/gregrahn/tpch-kit](url)
And it provides the information you need for data generation, size, etc

> Schema for lineitem

Adding more details and update the schema screenshot mentioned on previous
comment:

**RECORDKEY_FIELD_OPT_KEY**: is composite (l_linenumber, l_orderkey)
**PARTITIONPATH_FIELD_OPT_KEY**: optional default (non-portioned), or
l_shipmode
**PRECOMBINE_FIELD_OPT_KEY**: l_commitdate, or generating new timestamp
column **last_updated**.

![Screenshot from 2020-04-14
17-54-43](https://user-images.githubusercontent.com/20902425/79246806-f680cc00-7e79-11ea-8edc-bd711d8492ff.png)

This is the **official documentation** for the datasets definition, schema,
queries and business logic behind the queries
[http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.18.0.pdf](url)

> @bvaradar & @umehrot2 will have the ability to seamlessly bootstrap the
data into hudi without rewriting in the next release.

Are we talking about the proposal mentioned at
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi](url)

because we need more clarification regarding this approach.

> you ll also have the ability to do a one-time bulk_insert for last N
partitions to get the upsert performance benefits as we discussed above

There one of the attempts mentioned on the first comment which might be
similar, I will explain in details to check with you if it should work for now
or not, **if it provides a valid Hudi table**:

So consider we have table of size 1TB parquet format as an input table
either **partitioned** or **non- partitioned**. Spark resource 256GB ram and 32
cores:

**Case non-partitioned**

- We use the suggested/recommended partition column(s), then project this
partition column(s) and apply distinct which will provide you with filter
values you need to pass to next process of the pipe line.
- The next step submit a sequential spark applications which filter the
input data based on the passed filter value(s) resulting in data frame of
single partition.
- Write (**bulk-insert**) the filtered dataframe as Hudi table with the
provided partition column using save-mode **append**
- Hudi table being written partition by partition.
- Query the Hudi table to check if it is valid table, and it looks valid.

**Case partitioned** same as above, with faster filter operations.

**Pros**:
- Avoided a lot of disk spilling, GC hits.
- Using less resources for initial loading.

**Cons**:
- No time improvements in case you have enough resources to load the table
at once.
- We ended up with partitioned table, which might not be needed in some of
our use cases.

**Questions**:
- If this approach is valid or going to impact the upsert operations in the
future?

> I would be happy to jump on a call with you folks and get this moving
along.. I am also very excited to work with a user like yourself and move the
perf aspects of the project along more..

We are excited as well to have a call together, please inform me how we can
proceed on this meeting.

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

With regards,
Apache Git Services

[jira] [Commented] (HUDI-781) Re-design test utilities

2020-04-14 Thread Raymond Xu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083451#comment-17083451
 ] 

Raymond Xu commented on HUDI-781:
-

[~vinothchandar] [~yanghua] Before we consider re-design some test utilities, I 
think it will be easier to just move classes and categorize them. I'm thinking 
adding these 2 packages under each module

* functional/ contains test cases that require spark context and local servers
* testutils/ contains all test utilities for that module

all other test packages should only contain pure-logic unit tests

As for RFC, guess it's not very necessary at the moment for the restructuring. 
WDYT?

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] ahmed-elfar commented on issue #1498: Migrating parquet table to hudi issue [SUPPORT]