[jira] [Updated] (HUDI-482) Fix missing @Override annotation on method

2019-12-30 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-482:
---
Fix Version/s: 0.5.1

> Fix missing @Override annotation on method
> --
>
> Key: HUDI-482
> URL: https://issues.apache.org/jira/browse/HUDI-482
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> An overridden method from an interface or abstract class should be marked 
> with @Override annotation. Once the method signature in the abstract class is 
> changed, the implementation class will report a compile error immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-482) Fix missing @Override annotation on method

2019-12-30 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed HUDI-482.
--

> Fix missing @Override annotation on method
> --
>
> Key: HUDI-482
> URL: https://issues.apache.org/jira/browse/HUDI-482
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> An overridden method from an interface or abstract class should be marked 
> with @Override annotation. Once the method signature in the abstract class is 
> changed, the implementation class will report a compile error immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-482) Fix missing @Override annotation on method

2019-12-30 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi resolved HUDI-482.

Resolution: Fixed

> Fix missing @Override annotation on method
> --
>
> Key: HUDI-482
> URL: https://issues.apache.org/jira/browse/HUDI-482
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> An overridden method from an interface or abstract class should be marked 
> with @Override annotation. Once the method signature in the abstract class is 
> changed, the implementation class will report a compile error immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-482) Fix missing @Override annotation on method

2019-12-30 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-482:
---
Status: Open  (was: New)

> Fix missing @Override annotation on method
> --
>
> Key: HUDI-482
> URL: https://issues.apache.org/jira/browse/HUDI-482
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> An overridden method from an interface or abstract class should be marked 
> with @Override annotation. Once the method signature in the abstract class is 
> changed, the implementation class will report a compile error immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-472) Make sortBy() inside bulkInsertInternal() configurable for bulk_insert

2019-12-30 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005978#comment-17005978
 ] 

Ethan Guo commented on HUDI-472:


[~hzp2000itb]  just noticed that you assigned the ticket to yourself.  Are you 
working on other aspects than what I described in the PR and above?  We can 
create another ticket for you to avoid conflicts.    

> Make sortBy() inside bulkInsertInternal() configurable for bulk_insert
> --
>
> Key: HUDI-472
> URL: https://issues.apache.org/jira/browse/HUDI-472
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Ethan Guo
>Assignee: He ZongPing
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-472) Make sortBy() inside bulkInsertInternal() configurable for bulk_insert

2019-12-30 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005976#comment-17005976
 ] 

Ethan Guo commented on HUDI-472:


I've put up a WIP PR regarding this.

The original design choice to apply sortBy() to all deduped records inside 
bulkInsertInternal() for bulk_insert is that the records are in order in terms 
of the data partition path + record key across all RDD partitions, and each RDD 
partition has a similar number of records.  This gracefully makes sure that 
while iterating through the records, the data partition path of the next record 
either remains the same as the current one or monotonically increases.  This 
guarantees that we only need one parquet file writer (CopyOnWriteInsertHandler) 
at a time and the memory pressure is very low (don't need to cache many records 
in memory) for meeting the file sizing requirements.  The resulting parquet 
files within each data partition have mutually exclusive record key 
range/index, making it efficient for finding the file with a specific record 
key.

Two nuances here regarding this design choice:

(1) Repartitioning the deduped records based on only the data partition path 
may not work well if the input records are highly skewed in terms of the data 
partition path, e.g., one partition has much more data than another partition, 
thus one RDD partition after repartition may have too much data to process 
compared to other RDD partitions, underutilizing the parallelism.

(2) If no sorting is done for the records within in an RDD partition, there is 
higher memory pressure on the executor as the executor must keep multiple 
parquet file writers (CopyOnWriteInsertHandler) open and keep the records in 
memory for bundling and efficient write (the max memory usage can be min(data 
size of RDD partition, number of possible data partition paths * buffer size)). 
 This does not scale well for TB bulk_insert as the amount of data can easily 
surpass the amount of memory available.

(3) If partition path + record key is not sorted globally, i.e., many RDD 
partitions may have records from overlapping data partition paths, each data 
partition would have many small parquet files written from many RDD partitions, 
with a possible overlapping range of record key (min and max).  The seek of a 
record key would have a performance hit.

These design choices target at very large scale bulk_insert.  For relatively 
small bulk_insert, these choices may incur unnecessary overhead and user may 
not benefit from them.  So the WIP PR addresses this by providing a knob to 
disable sorting and another knob to choose to the sort mode (global sort, local 
sort in RDD partition) so that they are configurable based on specific 
workload.  Again, the performance of bulk_insert depends on both the profile of 
input data and the configs used.

I have another thought that has not been implemented in the PR.  The sortBy() 
in Spark has two general stages: (1) sampling all the records and determining 
the record key range for each RDD partition, then shuffling the data based on 
the ranges, (2) sort within each RDD partition.  One tradeoff we can make here 
is to keep stage 1 and not to do stage 2.  Instead, we keep multiple parquet 
file writers while writing the records.  In this case, the number of writers 
should be bounded as range sorting has been done in stage 1.  This may work for 
intermediate bulk_insert.

 

 

 

> Make sortBy() inside bulkInsertInternal() configurable for bulk_insert
> --
>
> Key: HUDI-472
> URL: https://issues.apache.org/jira/browse/HUDI-472
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Ethan Guo
>Assignee: He ZongPing
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] hddong commented on issue #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2019-12-30 Thread GitBox
hddong commented on issue #1157: [HUDI-332]Add operation type 
(insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#issuecomment-569882681
 
 
   @bvaradar thanks for your review and suggestion, I will modify later.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-457) Redo hudi-common log statements using SLF4J

2019-12-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-457:

Labels: pull-request-available  (was: )

> Redo hudi-common log statements using SLF4J
> ---
>
> Key: HUDI-457
> URL: https://issues.apache.org/jira/browse/HUDI-457
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: leesf
>Assignee: Jiaqi Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] sev7e0 opened a new pull request #1161: [HUDI-457]Redo hudi-common log statements using SLF4J

2019-12-30 Thread GitBox
sev7e0 opened a new pull request #1161: [HUDI-457]Redo hudi-common log 
statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1161
 
 
   
   ## What is the purpose of the pull request
   
   Redo hudi-common log statements using SLF4J
   
   ## Brief change log
   
   
   
   ## Verify this pull request
   
   
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong commented on issue #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2019-12-30 Thread GitBox
hddong commented on issue #1157: [HUDI-332]Add operation type 
(insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#issuecomment-569881431
 
 
   > @hddong can you explain why add operation type to HoodieCommitMetadata? 
thanks
   
   @hmatu It has no special function except for information tagging and tracing 
at this patch now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-404) Compile Master's Source Code Error

2019-12-30 Thread Suneel Marthi (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005965#comment-17005965
 ] 

Suneel Marthi commented on HUDI-404:


This issue can be 'Closed' since the PR for this has been closed.

> Compile Master's Source Code  Error
> ---
>
> Key: HUDI-404
> URL: https://issues.apache.org/jira/browse/HUDI-404
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Usability
>Reporter: Xurenhe
>Priority: Major
>  Labels: pull-request-available
> Attachments: hudi-compile-error.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hi, I download source code of Hudi, when I use command of 'mvn clean package 
> -DskipTests -DskipITs' to compile this project, but some error happened.
> I check the maven's dependencies, I find miss one dependency of 
> 'com.google.code.findbugs:jsr305'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1154: [HUDI-406] Added default partition path in TimestampBasedKeyGenerator

2019-12-30 Thread GitBox
pratyakshsharma commented on a change in pull request #1154: [HUDI-406] Added 
default partition path in TimestampBasedKeyGenerator
URL: https://github.com/apache/incubator-hudi/pull/1154#discussion_r362162038
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
 ##
 @@ -64,6 +64,14 @@ public static String 
getNullableNestedFieldValAsString(GenericRecord record, Str
 }
   }
 
+  public static Object getNullableNestedFieldVal(GenericRecord record, String 
fieldName) {
+try {
+  return getNestedFieldVal(record, fieldName);
 
 Review comment:
   @bvaradar Actually I wrote the above function following the way 
getNullableNestedFieldValAsString function is written in the same class. Should 
I change the usages of that function also by passing a flag like how you are 
suggesting? That function is also compute intensive. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-343) Create a DOAP File for Hudi

2019-12-30 Thread Suneel Marthi (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005958#comment-17005958
 ] 

Suneel Marthi commented on HUDI-343:


Pull Request available for review - we need to get this out before the next 
release.

> Create a DOAP File for Hudi
> ---
>
> Key: HUDI-343
> URL: https://issues.apache.org/jira/browse/HUDI-343
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> But please create a DOAP file for Hudi, where you can also list the
> release: https://projects.apache.org/create.html
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1154: [HUDI-406] Added default partition path in TimestampBasedKeyGenerator

2019-12-30 Thread GitBox
pratyakshsharma commented on a change in pull request #1154: [HUDI-406] Added 
default partition path in TimestampBasedKeyGenerator
URL: https://github.com/apache/incubator-hudi/pull/1154#discussion_r362160009
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
 ##
 @@ -64,6 +64,14 @@ public static String 
getNullableNestedFieldValAsString(GenericRecord record, Str
 }
   }
 
+  public static Object getNullableNestedFieldVal(GenericRecord record, String 
fieldName) {
+try {
+  return getNestedFieldVal(record, fieldName);
 
 Review comment:
   Point well taken. Will do the changes. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (e637d9e -> add4b1e)

2019-12-30 Thread smarthi
This is an automated email from the ASF dual-hosted git repository.

smarthi pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from e637d9e  [HUDI-455] Redo hudi-client log statements using SLF4J (#1145)
 new bb90ded  [MINOR] Fix out of limits for results
 new 36c0e6b  [MINOR] Fix out of limits for results
 new 74b00d1  trigger rebuild
 new 619f501  Clean up code
 new add4b1e  Merge pull request #1143 from BigDataArtisans/outoflimit

The 695 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../org/apache/hudi/cli/commands/HoodieLogFileCommand.java   | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)



[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1154: [HUDI-406] Added default partition path in TimestampBasedKeyGenerator

2019-12-30 Thread GitBox
pratyakshsharma commented on a change in pull request #1154: [HUDI-406] Added 
default partition path in TimestampBasedKeyGenerator
URL: https://github.com/apache/incubator-hudi/pull/1154#discussion_r362159857
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/keygen/TimestampBasedKeyGenerator.java
 ##
 @@ -96,8 +95,7 @@ public HoodieKey getKey(GenericRecord record) {
   } else if (partitionVal instanceof String) {
 unixTime = inputDateFormat.parse(partitionVal.toString()).getTime() / 
1000;
   } else {
-throw new HoodieNotSupportedException(
-"Unexpected type for partition field: " + 
partitionVal.getClass().getName());
+unixTime = 1L;
 
 Review comment:
   @bvaradar I changed here with the idea that when partitionVal is returned as 
null from DataSourceUtils, it will not match any of the instanceof conditions, 
and will result in throwing the exception according to previous code. 
   But this is a valid point that the exception can be due to some user 
configuration error as well. Will modify the code accordingly. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi merged pull request #1143: [MINOR] Fix out of limits for results

2019-12-30 Thread GitBox
smarthi merged pull request #1143: [MINOR] Fix out of limits for results
URL: https://github.com/apache/incubator-hudi/pull/1143
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-343) Create a DOAP File for Hudi

2019-12-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-343:

Labels: pull-request-available  (was: )

> Create a DOAP File for Hudi
> ---
>
> Key: HUDI-343
> URL: https://issues.apache.org/jira/browse/HUDI-343
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>
> But please create a DOAP file for Hudi, where you can also list the
> release: https://projects.apache.org/create.html
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] smarthi opened a new pull request #1160: [HUDI-343]: Create a DOAP file for Hudi

2019-12-30 Thread GitBox
smarthi opened a new pull request #1160: [HUDI-343]: Create a DOAP file for Hudi
URL: https://github.com/apache/incubator-hudi/pull/1160
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   DOAP file for HUDI
   
   ## Brief change log
   
   ASF Administrative Requirement for all projects in Foundation
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [X] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1143: [MINOR] Fix out of limits for results

2019-12-30 Thread GitBox
lamber-ken commented on a change in pull request #1143: [MINOR] Fix out of 
limits for results
URL: https://github.com/apache/incubator-hudi/pull/1143#discussion_r362153348
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
 ##
 @@ -215,9 +215,11 @@ public String showLogFileRecords(
   if (n instanceof HoodieAvroDataBlock) {
 HoodieAvroDataBlock blk = (HoodieAvroDataBlock) n;
 List records = blk.getRecords();
-allRecords.addAll(records);
-if (allRecords.size() >= limit) {
-  break;
+for (IndexedRecord record : records) {
+  if (allRecords.size() >= limit) {
 
 Review comment:
   Reasonable,  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Resolved] (HUDI-455) Redo hudi-client log statements using SLF4J

2019-12-30 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang resolved HUDI-455.
---
Resolution: Done

Done via master branch: e637d9ed26fea1a336f2fd6139cde0dd192c429d

> Redo hudi-client log statements using SLF4J
> ---
>
> Key: HUDI-455
> URL: https://issues.apache.org/jira/browse/HUDI-455
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: leesf
>Assignee: hejinbiao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-455) Redo hudi-client log statements using SLF4J

2019-12-30 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-455:
--
Fix Version/s: 0.5.1

> Redo hudi-client log statements using SLF4J
> ---
>
> Key: HUDI-455
> URL: https://issues.apache.org/jira/browse/HUDI-455
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: leesf
>Assignee: hejinbiao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua merged pull request #1145: [HUDI-455] Redo hudi-client log statements using SLF4J

2019-12-30 Thread GitBox
yanghua merged pull request #1145: [HUDI-455] Redo hudi-client log statements 
using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1145
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (ab6ae5c -> e637d9e)

2019-12-30 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from ab6ae5c  [HUDI-482] Fix missing @Override annotation on methods (#1156)
 add e637d9e  [HUDI-455] Redo hudi-client log statements using SLF4J (#1145)

No new revisions were added by this update.

Summary of changes:
 hudi-client/pom.xml|  5 ++
 .../java/org/apache/hudi/AbstractHoodieClient.java |  6 +--
 .../org/apache/hudi/CompactionAdminClient.java | 17 +++
 .../java/org/apache/hudi/HoodieCleanClient.java| 16 +++
 .../java/org/apache/hudi/HoodieReadClient.java |  6 +--
 .../java/org/apache/hudi/HoodieWriteClient.java| 56 +++---
 .../client/embedded/EmbeddedTimelineService.java   | 10 ++--
 .../hbase/DefaultHBaseQPSResourceAllocator.java| 10 ++--
 .../org/apache/hudi/index/hbase/HBaseIndex.java| 42 
 .../org/apache/hudi/io/HoodieAppendHandle.java | 20 
 .../java/org/apache/hudi/io/HoodieCleanHelper.java | 16 +++
 .../org/apache/hudi/io/HoodieCommitArchiveLog.java | 22 -
 .../org/apache/hudi/io/HoodieCreateHandle.java | 17 ---
 .../org/apache/hudi/io/HoodieKeyLookupHandle.java  | 19 
 .../java/org/apache/hudi/io/HoodieMergeHandle.java | 39 ---
 .../java/org/apache/hudi/io/HoodieWriteHandle.java | 10 ++--
 .../io/compact/HoodieRealtimeTableCompactor.java   | 28 +--
 .../org/apache/hudi/metrics/HoodieMetrics.java | 17 +++
 .../apache/hudi/metrics/JmxMetricsReporter.java|  6 +--
 .../main/java/org/apache/hudi/metrics/Metrics.java |  6 +--
 .../hudi/metrics/MetricsGraphiteReporter.java  |  6 +--
 .../hudi/metrics/MetricsReporterFactory.java   |  8 ++--
 .../apache/hudi/table/HoodieCopyOnWriteTable.java  | 52 ++--
 .../apache/hudi/table/HoodieMergeOnReadTable.java  | 27 +--
 .../java/org/apache/hudi/table/HoodieTable.java| 12 ++---
 .../org/apache/hudi/table/RollbackExecutor.java| 14 +++---
 26 files changed, 242 insertions(+), 245 deletions(-)



[jira] [Commented] (HUDI-474) Delta Streamer is not able to read the commit files

2019-12-30 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005938#comment-17005938
 ] 

Balaji Varadarajan commented on HUDI-474:
-

[~srkhan]: [~lamber-ken]'s fix would result in empty clean requested files no 
longer getting created. So, if you can rerun the ingestion after cleaning up 
the empty clean requested files, it  should hopefully work.

> Delta Streamer is not able to read the commit files
> ---
>
> Key: HUDI-474
> URL: https://issues.apache.org/jira/browse/HUDI-474
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Shahida Khan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Gmail - Commit time issue in DeltaStreamer 
> (Real-Time).pdf
>
>
> DeltaStreamer is not to able to read the correct commit files under when job 
> is deployed realtime.
> below is the stack trace: 
> {code:java}
> ava.util.concurrent.ExecutionException:
>  org.apache.hudi.exception.HoodieException: Could not read commit
>  details from 
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
>       at
>  java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
>    at
>  java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at
>  
> org.apache.hudi.utilities.deltastreamer.AbstractDeltaStreamerService.waitForShutdown(AbstractDeltaStreamerService.java:72)
>       at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:117)
>   at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:297)
>   at
>  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at
>  
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>   at
>  
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    at
>  java.lang.reflect.Method.invoke(Method.java:498)        at
>  
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)Caused
>  by: org.apache.hudi.exception.HoodieException: Could not read commit
>  details from 
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
>       at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:411)
>         at
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>      at
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at
>  java.lang.Thread.run(Thread.java:748)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1143: [MINOR] Fix out of limits for results

2019-12-30 Thread GitBox
smarthi commented on a change in pull request #1143: [MINOR] Fix out of limits 
for results
URL: https://github.com/apache/incubator-hudi/pull/1143#discussion_r362147928
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/HoodieLogFileCommand.java
 ##
 @@ -215,9 +215,11 @@ public String showLogFileRecords(
   if (n instanceof HoodieAvroDataBlock) {
 HoodieAvroDataBlock blk = (HoodieAvroDataBlock) n;
 List records = blk.getRecords();
-allRecords.addAll(records);
-if (allRecords.size() >= limit) {
-  break;
+for (IndexedRecord record : records) {
+  if (allRecords.size() >= limit) {
 
 Review comment:
   nitpick: reverse the if condition and avoid 'break'


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua merged pull request #1156: [HUDI-482] Fix missing @Override annotation on methods

2019-12-30 Thread GitBox
yanghua merged pull request #1156: [HUDI-482] Fix missing @Override annotation 
on methods
URL: https://github.com/apache/incubator-hudi/pull/1156
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (2a823f3 -> ab6ae5c)

2019-12-30 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 2a823f3  [MINOR]: alter some wrong params which bring fatal exception
 add ab6ae5c  [HUDI-482] Fix missing @Override annotation on methods (#1156)

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/hudi/cli/HoodieHistoryFileNameProvider.java | 1 +
 hudi-cli/src/main/java/org/apache/hudi/cli/HoodieSplashScreen.java  | 3 +++
 hudi-client/src/main/java/org/apache/hudi/AbstractHoodieClient.java | 1 +
 hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java| 1 +
 .../java/org/apache/hudi/func/SparkBoundedInMemoryExecutor.java | 1 +
 .../main/java/org/apache/hudi/index/bloom/BloomIndexFileInfo.java   | 1 +
 .../src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java   | 2 ++
 .../src/main/java/org/apache/hudi/io/HoodieCreateHandle.java| 1 +
 hudi-client/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java | 6 ++
 .../main/java/org/apache/hudi/io/storage/HoodieParquetWriter.java   | 1 +
 .../src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java | 1 +
 .../java/org/apache/hudi/common/table/log/HoodieLogFileReader.java  | 1 +
 .../org/apache/hudi/common/table/log/HoodieLogFormatWriter.java | 2 ++
 .../org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java | 1 +
 .../apache/hudi/common/table/view/AbstractTableFileSystemView.java  | 3 +++
 .../apache/hudi/common/table/view/HoodieTableFileSystemView.java| 2 ++
 .../hudi/common/table/view/SpillableMapBasedFileSystemView.java | 2 ++
 .../main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java | 1 +
 .../java/org/apache/hudi/common/util/collection/DiskBasedMap.java   | 1 +
 .../org/apache/hudi/common/util/collection/LazyFileIterable.java| 1 +
 .../main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java  | 2 ++
 .../org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java   | 6 ++
 .../apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java| 1 +
 23 files changed, 42 insertions(+)



Build failed in Jenkins: hudi-snapshot-deployment-0.5 #145

2019-12-30 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.18 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark [jar]
[INFO] hudi-utilities [jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle  [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle  [jar]
[INFO] hudi-timeline-server-bundle

[GitHub] [incubator-hudi] yanghua commented on issue #626: Adding documentation for hudi test suite

2019-12-30 Thread GitBox
yanghua commented on issue #626: Adding documentation for hudi test suite
URL: https://github.com/apache/incubator-hudi/pull/626#issuecomment-569853512
 
 
   OK, I will do it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1151: [WIP] [HUDI-476] Add hudi-examples module

2019-12-30 Thread GitBox
yanghua commented on issue #1151: [WIP] [HUDI-476] Add hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#issuecomment-569851801
 
 
   Agree init a module with at least some example codes, it at least could 
prove that contributor has the power and energy to drive this work.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (HUDI-29) Patch to Hive-sync to enable stats on Hive tables #393

2019-12-30 Thread cdmikechen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-29?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cdmikechen reassigned HUDI-29:
--

Assignee: Vinoth Chandar

> Patch to Hive-sync to enable stats on Hive tables #393
> --
>
> Key: HUDI-29
> URL: https://issues.apache.org/jira/browse/HUDI-29
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/uber/hudi/issues/393



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-29) Patch to Hive-sync to enable stats on Hive tables #393

2019-12-30 Thread cdmikechen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-29?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cdmikechen updated HUDI-29:
---
Status: Closed  (was: Patch Available)

> Patch to Hive-sync to enable stats on Hive tables #393
> --
>
> Key: HUDI-29
> URL: https://issues.apache.org/jira/browse/HUDI-29
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/uber/hudi/issues/393



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-29) Patch to Hive-sync to enable stats on Hive tables #393

2019-12-30 Thread cdmikechen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-29?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cdmikechen reassigned HUDI-29:
--

Assignee: (was: Vinoth Chandar)

> Patch to Hive-sync to enable stats on Hive tables #393
> --
>
> Key: HUDI-29
> URL: https://issues.apache.org/jira/browse/HUDI-29
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/uber/hudi/issues/393



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-29) Patch to Hive-sync to enable stats on Hive tables #393

2019-12-30 Thread cdmikechen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-29?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cdmikechen reopened HUDI-29:


> Patch to Hive-sync to enable stats on Hive tables #393
> --
>
> Key: HUDI-29
> URL: https://issues.apache.org/jira/browse/HUDI-29
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/uber/hudi/issues/393



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-469) HoodieCommitMetadata only show first commit insert rows.

2019-12-30 Thread cdmikechen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cdmikechen updated HUDI-469:

Status: Patch Available  (was: In Progress)

> HoodieCommitMetadata only show first commit insert rows. 
> -
>
> Key: HUDI-469
> URL: https://issues.apache.org/jira/browse/HUDI-469
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI
>Reporter: cdmikechen
>Assignee: cdmikechen
>Priority: Major
> Fix For: 0.5.1
>
>
> When I run hudi cli to get insert rows, I found that hudi cli can not get 
> insert rows if it is not in first commit time. I found that 
> *{{HoodieCommitMetadata.fetchTotalInsertRecordsWritten()*}} method use 
> *{{stat.getPrevCommit().equalsIgnoreCase("null")*}} to filter first commit. 
> This check option should be removed。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2019-12-30 Thread GitBox
bvaradar commented on a change in pull request #1157: [HUDI-332]Add operation 
type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#discussion_r362130717
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/model/TestHoodieCommitMetadata.java
 ##
 @@ -46,4 +47,25 @@ public void testPerfStatPresenceInHoodieMetadata() throws 
Exception {
 Assert.assertTrue(metadata.getTotalScanTime() == 0);
 Assert.assertTrue(metadata.getTotalLogFilesCompacted() > 0);
   }
+
+  @Test
+  public void testCompatibilityWithoutOperateType() throws Exception {
 
 Review comment:
   Thanks for adding the compatibility tests.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [MINOR]: alter some wrong params which bring fatal exception

2019-12-30 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 2a823f3  [MINOR]: alter some wrong params which bring fatal exception
2a823f3 is described below

commit 2a823f32ee161d3e366979f2060cb93536e8059d
Author: dengziming 
AuthorDate: Mon Dec 30 20:39:03 2019 +0800

[MINOR]: alter some wrong params which bring fatal exception
---
 hudi-client/src/test/java/HoodieClientExample.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hudi-client/src/test/java/HoodieClientExample.java 
b/hudi-client/src/test/java/HoodieClientExample.java
index 38bd8a7..362a4fb 100644
--- a/hudi-client/src/test/java/HoodieClientExample.java
+++ b/hudi-client/src/test/java/HoodieClientExample.java
@@ -95,7 +95,7 @@ public class HoodieClientExample {
 HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder().withPath(tablePath)
 
.withSchema(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA).withParallelism(2, 
2).forTable(tableName)
 
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(IndexType.BLOOM).build())
-
.withCompactionConfig(HoodieCompactionConfig.newBuilder().archiveCommitsWith(2, 
3).build()).build();
+
.withCompactionConfig(HoodieCompactionConfig.newBuilder().archiveCommitsWith(20,
 30).build()).build();
 HoodieWriteClient client = new HoodieWriteClient(jsc, cfg);
 
 List recordsSoFar = new ArrayList<>();



[GitHub] [incubator-hudi] bvaradar merged pull request #1158: [MINOR]: alter some wrong params which bring fatal exception

2019-12-30 Thread GitBox
bvaradar merged pull request #1158: [MINOR]: alter some wrong params which 
bring fatal exception
URL: https://github.com/apache/incubator-hudi/pull/1158
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1159: HUDI-479: Eliminate or Minimize use of Guava if possible

2019-12-30 Thread GitBox
cdmikechen commented on a change in pull request #1159: HUDI-479: Eliminate or 
Minimize use of Guava if possible
URL: https://github.com/apache/incubator-hudi/pull/1159#discussion_r362128256
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/BloomIndexFileInfo.java
 ##
 @@ -78,22 +77,24 @@ public boolean equals(Object o) {
 }
 
 BloomIndexFileInfo that = (BloomIndexFileInfo) o;
-return Objects.equal(that.fileId, fileId) && 
Objects.equal(that.minRecordKey, minRecordKey)
-&& Objects.equal(that.maxRecordKey, maxRecordKey);
+return Objects.equals(that.fileId, fileId) && 
Objects.equals(that.minRecordKey, minRecordKey)
+&& Objects.equals(that.maxRecordKey, maxRecordKey);
 
   }
 
   @Override
   public int hashCode() {
-return Objects.hashCode(fileId, minRecordKey, maxRecordKey);
+return Objects.hash(fileId, minRecordKey, maxRecordKey);
   }
 
   public String toString() {
-final StringBuilder sb = new StringBuilder("BloomIndexFileInfo {");
-sb.append(" fileId=").append(fileId);
-sb.append(" minRecordKey=").append(minRecordKey);
-sb.append(" maxRecordKey=").append(maxRecordKey);
-sb.append('}');
-return sb.toString();
+return "BloomIndexFileInfo {"
 
 Review comment:
   I think original method may be OK. Do we have to modify it or not?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-91) Replace Databricks spark-avro with native spark-avro #628

2019-12-30 Thread cdmikechen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005859#comment-17005859
 ] 

cdmikechen commented on HUDI-91:


[~vinoth] I've tested that hive can read column type decimal and date. But 
timestamp need to change some codes in  *WritableTimestampObjectInspector.java* 
in Hive2, I've open an issue in 
https://issues.apache.org/jira/browse/HIVE-4.
Hive3 have support avro1.8 and can read avro logical type right.

> Replace Databricks spark-avro with native spark-avro #628
> -
>
> Key: HUDI-91
> URL: https://issues.apache.org/jira/browse/HUDI-91
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration, Usability
>Reporter: Vinoth Chandar
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/628] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-91) Replace Databricks spark-avro with native spark-avro #628

2019-12-30 Thread cdmikechen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005858#comment-17005858
 ] 

cdmikechen commented on HUDI-91:


Want to know the progress of this PR now. I think not every user use Spark2.4, 
Can we combine the two projects (databricks-avro and spark-avro) and simplify 
them as our own internal implementation, which can be compatible with most 
versions of spark2?

> Replace Databricks spark-avro with native spark-avro #628
> -
>
> Key: HUDI-91
> URL: https://issues.apache.org/jira/browse/HUDI-91
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration, Usability
>Reporter: Vinoth Chandar
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/628] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-474) Delta Streamer is not able to read the commit files

2019-12-30 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005806#comment-17005806
 ] 

lamber-ken edited comment on HUDI-474 at 12/30/19 8:18 PM:
---

Hi [~srkhan], can you build the latest master branch and check if the problem 
persists? :)


was (Author: lamber-ken):
Hi [~srkhan], can you build the latest master branch and check if the problem 
persists. :)

> Delta Streamer is not able to read the commit files
> ---
>
> Key: HUDI-474
> URL: https://issues.apache.org/jira/browse/HUDI-474
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Shahida Khan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Gmail - Commit time issue in DeltaStreamer 
> (Real-Time).pdf
>
>
> DeltaStreamer is not to able to read the correct commit files under when job 
> is deployed realtime.
> below is the stack trace: 
> {code:java}
> ava.util.concurrent.ExecutionException:
>  org.apache.hudi.exception.HoodieException: Could not read commit
>  details from 
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
>       at
>  java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
>    at
>  java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at
>  
> org.apache.hudi.utilities.deltastreamer.AbstractDeltaStreamerService.waitForShutdown(AbstractDeltaStreamerService.java:72)
>       at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:117)
>   at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:297)
>   at
>  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at
>  
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>   at
>  
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    at
>  java.lang.reflect.Method.invoke(Method.java:498)        at
>  
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)Caused
>  by: org.apache.hudi.exception.HoodieException: Could not read commit
>  details from 
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
>       at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:411)
>         at
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>      at
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at
>  java.lang.Thread.run(Thread.java:748)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-474) Delta Streamer is not able to read the commit files

2019-12-30 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005806#comment-17005806
 ] 

lamber-ken commented on HUDI-474:
-

Hi [~srkhan], can you build the latest master branch and check if the problem 
persists. :)

> Delta Streamer is not able to read the commit files
> ---
>
> Key: HUDI-474
> URL: https://issues.apache.org/jira/browse/HUDI-474
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Shahida Khan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Gmail - Commit time issue in DeltaStreamer 
> (Real-Time).pdf
>
>
> DeltaStreamer is not to able to read the correct commit files under when job 
> is deployed realtime.
> below is the stack trace: 
> {code:java}
> ava.util.concurrent.ExecutionException:
>  org.apache.hudi.exception.HoodieException: Could not read commit
>  details from 
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
>       at
>  java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
>    at
>  java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at
>  
> org.apache.hudi.utilities.deltastreamer.AbstractDeltaStreamerService.waitForShutdown(AbstractDeltaStreamerService.java:72)
>       at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:117)
>   at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:297)
>   at
>  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at
>  
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>   at
>  
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    at
>  java.lang.reflect.Method.invoke(Method.java:498)        at
>  
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)Caused
>  by: org.apache.hudi.exception.HoodieException: Could not read commit
>  details from 
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
>       at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:411)
>         at
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>      at
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at
>  java.lang.Thread.run(Thread.java:748)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-453) Throw failed to archive commits error when writing data to MOR/COW table

2019-12-30 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-453.
-
Resolution: Fixed

Fixed at master 58c5bed40a76189a28dd72bbd67fcefaac587184

> Throw failed to archive commits error when writing data to MOR/COW table
> 
>
> Key: HUDI-453
> URL: https://issues.apache.org/jira/browse/HUDI-453
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Throw failed to archive commits error when writing data to table, here are 
> reproduce steps.
> *1, Build from latest source*
> {code:java}
> mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true
> {code}
> *2, Write Data*
> {code:java}
> export SPARK_HOME=/work/BigData/install/spark/spark-2.3.3-bin-hadoop2.6
> ${SPARK_HOME}/bin/spark-shell --jars `ls 
> packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar` 
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> import org.apache.spark.sql.SaveMode._
> var datas = List("{ \"name\": \"kenken\", \"ts\": 1574297893836, \"age\": 12, 
> \"location\": \"latitude\"}")
> val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
> df.write.format("org.apache.hudi").
> option("hoodie.insert.shuffle.parallelism", "10").
> option("hoodie.upsert.shuffle.parallelism", "10").
> option("hoodie.delete.shuffle.parallelism", "10").
> option("hoodie.bulkinsert.shuffle.parallelism", "10").
> option("hoodie.datasource.write.recordkey.field", "name").
> option("hoodie.datasource.write.partitionpath.field", "location").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.table.name", "hudi_mor_table").
> mode(Overwrite).
> save("file:///tmp/hudi_mor_table")
> {code}
> *3, Append Data*
> {code:java}
> df.write.format("org.apache.hudi").
> option("hoodie.insert.shuffle.parallelism", "10").
> option("hoodie.upsert.shuffle.parallelism", "10").
> option("hoodie.delete.shuffle.parallelism", "10").
> option("hoodie.bulkinsert.shuffle.parallelism", "10").
> option("hoodie.datasource.write.recordkey.field", "name").
> option("hoodie.datasource.write.partitionpath.field", "location").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.keep.max.commits", "5").
> option("hoodie.keep.min.commits", "4").
> option("hoodie.cleaner.commits.retained", "3").
> option("hoodie.table.name", "hudi_mor_table").
> mode(Append).
> save("file:///tmp/hudi_mor_table")
> {code}
> *4, Repeat about six times Append Data operation(above), will get the 
> stackstrace*
> {code:java}
> 19/12/23 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
> .commit file: 20191224004558.clean.requested
> java.io.IOException: Not an Avro data file
> at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
> at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
> at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 

[jira] [Updated] (HUDI-479) Eliminate use of guava if possible

2019-12-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-479:

Labels: pull-request-available  (was: )

> Eliminate use of guava if possible
> --
>
> Key: HUDI-479
> URL: https://issues.apache.org/jira/browse/HUDI-479
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #626: Adding documentation for hudi test suite

2019-12-30 Thread GitBox
vinothchandar commented on issue #626: Adding documentation for hudi test suite
URL: https://github.com/apache/incubator-hudi/pull/626#issuecomment-569763262
 
 
   That makes a lot of sense.. +1 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-479) Eliminate use of guava if possible

2019-12-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005515#comment-17005515
 ] 

Vinoth Chandar commented on HUDI-479:
-

Spark 2.4.4 still supports Java8 [https://spark.apache.org/docs/2.4.4/] 

 

"Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 
2.4.4 uses Scala 2.12. You will need to use a compatible Scala version 
(2.12.x)."

 

So don't think we can kill java8 right away

> Eliminate use of guava if possible
> --
>
> Key: HUDI-479
> URL: https://issues.apache.org/jira/browse/HUDI-479
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-487) Unit tests for hudi-cli

2019-12-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-487:

Status: Open  (was: New)

> Unit tests for hudi-cli
> ---
>
> Key: HUDI-487
> URL: https://issues.apache.org/jira/browse/HUDI-487
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI, Testing
>Reporter: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1138: [HUDI-470] Fix NPE when print result via hudi-cli

2019-12-30 Thread GitBox
vinothchandar commented on issue #1138: [HUDI-470] Fix NPE when print result 
via hudi-cli
URL: https://github.com/apache/incubator-hudi/pull/1138#issuecomment-569755666
 
 
   https://issues.apache.org/jira/browse/HUDI-487 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-487) Unit tests for hudi-cli

2019-12-30 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-487:
---

 Summary: Unit tests for hudi-cli
 Key: HUDI-487
 URL: https://issues.apache.org/jira/browse/HUDI-487
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: CLI, Testing
Reporter: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-481) Support SQL-like method

2019-12-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005507#comment-17005507
 ] 

Vinoth Chandar commented on HUDI-481:
-

I am not sure if `CLI` is the right component for this. First few questions 
before I can triage this.. 

 
 * Is this intended to be a Spark API? We have thought about adding support in 
Spark SQL to specify the merge logic vs HoodieRecordPayload interface.. This 
sounds similar. 
 * I think we need to move towards Spark Datasource V2 api first.. and then 
rethink how this will fit in HUDI-30

> Support SQL-like method
> ---
>
> Key: HUDI-481
> URL: https://issues.apache.org/jira/browse/HUDI-481
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: cdmikechen
>Priority: Minor
>
> As we know, Hudi use spark datasource api to upsert data. For example, if we 
> want to update a data, we need to get the old row's data first, and use 
> upsert method to update this row.
> But there's another situation where someone just wants to update one column 
> of data. If we use a sql to describe, it is {{update table set col1 = X where 
> col2 = Y}}. This is something hudi cannot deal with directly at present, we 
> can only get all the data involved as a dataset first and then merge it.
> So I think maybe we can create a new subproject to process the batch data in 
> an sql-like method. For example.
>  {code}
> val hudiTable = new HudiTable(path)
> hudiTable.update.set("col1 = X").where("col2 = Y")
> hudiTable.delete.where("col3 = Z")
> hudiTable.commit
> {code}
> It may also extend the functionality and support jdbc-like RFC schemes: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
> Hope every one can provide some suggestions to see if this plan is feasible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-480) Support a querying delete data methond in incremental view

2019-12-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005506#comment-17005506
 ] 

Vinoth Chandar commented on HUDI-480:
-

Do you think this is specific to deletes? Could we generalize this to a new 
config say 'include.before.image=true' in incremental pull where, you get two 
val;ues in incremental pull. Currently, you will only get one value per record 
upserted/deleted

 

 
||Operation||include.before.image=false||include.before.image=true||
|insert|new_value_inserted|[null, new_value_inserted]|
|update/soft delete|new_value_updated|[old_value, new_value]|
|hard delete|May not get anything today.|[deleted_value, null]|

 

> Support a querying delete data methond in incremental view
> --
>
> Key: HUDI-480
> URL: https://issues.apache.org/jira/browse/HUDI-480
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Incremental Pull
>Reporter: cdmikechen
>Priority: Minor
>
> As we known, hudi have supported many method to query data in Spark and Hive 
> and Presto. And it also provides a very good timeline idea to trace changes 
> in data, and it can be used to query incremental data in incremental view.
> In old time, we just have insert and update funciton to upsert data, and now 
> we have added new functions to delete some existing data.
> *[HUDI-328] Adding delete api to HoodieWriteClient* 
> https://github.com/apache/incubator-hudi/pull/1004
> *[HUDI-377] Adding Delete() support to 
> DeltaStreamer**https://github.com/apache/incubator-hudi/pull/1073
> So I think if we have delete api, should we add another method to get deleted 
> data in incremental view?
> I've looked at the methods for generating new parquet files. I think the main 
> idea is to combine old and new data, and then filter the data which need to 
> be deleted, so that the deleted data does not exist in the new dataset. 
> However, in this way, the data to be deleted will not be retained in new 
> dataset, so that only the inserted or modified data can be found according to 
> the existing timestamp field during data tracing in incremental view.
> If we can do it, I feel that there are two ideas to consider:
> 1. Trace the dataset in the same file at different time check points 
> according to the timeline, compare the two datasets according to the key and 
> filter out the deleted data. This method does not consume extra when writing, 
> but it needs to call the analysis function according to the actual request 
> during query, which consumes a lot.
> 2. When writing data, if there is any deleted data, we will record it. File 
> name such as *.delete_filename_version_timestamp*. So that we can immediately 
> give feedback according to the time. But additional processing will be done 
> at the time of writing.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-476) Add a hudi-examples module

2019-12-30 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005496#comment-17005496
 ] 

Vinoth Chandar commented on HUDI-476:
-

I agree this is valid to do, accepted the issue. Can we flesh this out with 
atleast 5-6 examples, that we can show case first. Agree on them before we can 
proceed to implementing? 

> Add a hudi-examples module
> --
>
> Key: HUDI-476
> URL: https://issues.apache.org/jira/browse/HUDI-476
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: dengziming
>Assignee: dengziming
>Priority: Major
>
> add a hudi-examples module to add some examples code



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-476) Add a hudi-examples module

2019-12-30 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-476:

Status: Open  (was: New)

> Add a hudi-examples module
> --
>
> Key: HUDI-476
> URL: https://issues.apache.org/jira/browse/HUDI-476
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: dengziming
>Assignee: dengziming
>Priority: Major
>
> add a hudi-examples module to add some examples code



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-446) Refactor the codes based on scala codestyle PublicMethodsHaveTypeChecker rule

2019-12-30 Thread Leping Huang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005480#comment-17005480
 ] 

Leping Huang commented on HUDI-446:
---

Hi,

OK

> Refactor the codes based on scala codestyle PublicMethodsHaveTypeChecker rule
> -
>
> Key: HUDI-446
> URL: https://issues.apache.org/jira/browse/HUDI-446
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: Leping Huang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-479) Eliminate use of guava if possible

2019-12-30 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-479:
---
Issue Type: Improvement  (was: Bug)

> Eliminate use of guava if possible
> --
>
> Key: HUDI-479
> URL: https://issues.apache.org/jira/browse/HUDI-479
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-479) Eliminate use of guava if possible

2019-12-30 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-479:
---
Status: In Progress  (was: Open)

> Eliminate use of guava if possible
> --
>
> Key: HUDI-479
> URL: https://issues.apache.org/jira/browse/HUDI-479
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] XuQianJin-Stars edited a comment on issue #1106: [HUDI-209] Implement JMX metrics reporter

2019-12-30 Thread GitBox
XuQianJin-Stars edited a comment on issue #1106: [HUDI-209] Implement JMX 
metrics reporter
URL: https://github.com/apache/incubator-hudi/pull/1106#issuecomment-569602971
 
 
   hi @vinothchandar @leesf  can you continue to review this PR? Thank you very 
much.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2019-12-30 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005392#comment-17005392
 ] 

Pratyaksh Sharma commented on HUDI-486:
---

Here is the working command - java -cp 
/tmp/jars/commons-io-2.6.jar:/tmp/jars/protobuf-java-2.5.0.jar:/tmp/jars/commons-cli-1.2.jar:/tmp/jars/htrace-core-3.1.0-incubating.jar:/tmp/jars/hadoop-hdfs-2.7.3.jar:/tmp/jars/hive-serde-2.3.1.jar:/tmp/jars/hive-common-2.3.1.jar:/tmp/jars/hive-service-2.3.1.jar:/tmp/jars/hive-service-rpc-2.3.1.jar:/tmp/jars/httpcore-4.3.2.jar:/tmp/jars/libthrift-0.12.0.jar:/tmp/jars/slf4j-api-1.7.16.jar:/tmp/jars/hadoop-auth-2.7.3.jar:/var/hoodie/ws/docker/demo/config/commons-lang-2.6.jar:/var/hoodie/ws/docker/demo/config/commons-configuration-1.6.jar:/var/hoodie/ws/docker/demo/config/commons-collections-3.2.2.jar:/var/hoodie/ws/docker/demo/config/guava-15.0.jar:/var/hoodie/ws/docker/demo/config/commons-logging-1.1.3.jar:/var/hoodie/ws/docker/demo/config/hadoop-common-2.7.3.jar:/var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:/var/hoodie/ws/docker/demo/config/antlr-runtime-3.5.2.jar:$HUDI_UTILITIES_BUNDLE
 org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl 
jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive --extractSQLFile 
/var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb default 
--sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable 
--fromCommitTime 0 --maxCommits 1

Options - 
 # Mention the above command somewhere
 # List down the jars needed
 # create a script similar to how we have for HiveSyncTool. 

We can choose either one of the above. [~vinoth] WDYT?

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2019-12-30 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005392#comment-17005392
 ] 

Pratyaksh Sharma edited comment on HUDI-486 at 12/30/19 3:23 PM:
-

Here is the working command - 

java -cp 
/tmp/jars/commons-io-2.6.jar:/tmp/jars/protobuf-java-2.5.0.jar:/tmp/jars/commons-cli-1.2.jar:/tmp/jars/htrace-core-3.1.0-incubating.jar:/tmp/jars/hadoop-hdfs-2.7.3.jar:/tmp/jars/hive-serde-2.3.1.jar:/tmp/jars/hive-common-2.3.1.jar:/tmp/jars/hive-service-2.3.1.jar:/tmp/jars/hive-service-rpc-2.3.1.jar:/tmp/jars/httpcore-4.3.2.jar:/tmp/jars/libthrift-0.12.0.jar:/tmp/jars/slf4j-api-1.7.16.jar:/tmp/jars/hadoop-auth-2.7.3.jar:/var/hoodie/ws/docker/demo/config/commons-lang-2.6.jar:/var/hoodie/ws/docker/demo/config/commons-configuration-1.6.jar:/var/hoodie/ws/docker/demo/config/commons-collections-3.2.2.jar:/var/hoodie/ws/docker/demo/config/guava-15.0.jar:/var/hoodie/ws/docker/demo/config/commons-logging-1.1.3.jar:/var/hoodie/ws/docker/demo/config/hadoop-common-2.7.3.jar:/var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:/var/hoodie/ws/docker/demo/config/antlr-runtime-3.5.2.jar:$HUDI_UTILITIES_BUNDLE
 org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl 
jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive --extractSQLFile 
/var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb default 
--sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable 
--fromCommitTime 0 --maxCommits 1

Options - 
 # Mention the above command somewhere
 # List down the jars needed
 # create a script similar to how we have for HiveSyncTool. 

We can choose either one of the above. [~vinoth] WDYT?


was (Author: pratyaksh):
Here is the working command - java -cp 
/tmp/jars/commons-io-2.6.jar:/tmp/jars/protobuf-java-2.5.0.jar:/tmp/jars/commons-cli-1.2.jar:/tmp/jars/htrace-core-3.1.0-incubating.jar:/tmp/jars/hadoop-hdfs-2.7.3.jar:/tmp/jars/hive-serde-2.3.1.jar:/tmp/jars/hive-common-2.3.1.jar:/tmp/jars/hive-service-2.3.1.jar:/tmp/jars/hive-service-rpc-2.3.1.jar:/tmp/jars/httpcore-4.3.2.jar:/tmp/jars/libthrift-0.12.0.jar:/tmp/jars/slf4j-api-1.7.16.jar:/tmp/jars/hadoop-auth-2.7.3.jar:/var/hoodie/ws/docker/demo/config/commons-lang-2.6.jar:/var/hoodie/ws/docker/demo/config/commons-configuration-1.6.jar:/var/hoodie/ws/docker/demo/config/commons-collections-3.2.2.jar:/var/hoodie/ws/docker/demo/config/guava-15.0.jar:/var/hoodie/ws/docker/demo/config/commons-logging-1.1.3.jar:/var/hoodie/ws/docker/demo/config/hadoop-common-2.7.3.jar:/var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:/var/hoodie/ws/docker/demo/config/antlr-runtime-3.5.2.jar:$HUDI_UTILITIES_BUNDLE
 org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl 
jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive --extractSQLFile 
/var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb default 
--sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable 
--fromCommitTime 0 --maxCommits 1

Options - 
 # Mention the above command somewhere
 # List down the jars needed
 # create a script similar to how we have for HiveSyncTool. 

We can choose either one of the above. [~vinoth] WDYT?

> Improve documentation for using HiveIncrementalPuller
> -
>
> Key: HUDI-486
> URL: https://issues.apache.org/jira/browse/HUDI-486
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> For using HiveIncrementalPuller, one needs to have a lot of jars in 
> classPath. These jars are not listed anywhere. As a result, one has to keep 
> on adding the jars incrementally to the classPath with every 
> NoClassDefFoundError coming up when executing. 
> We should list down the jars needed so that it becomes easy for a first-time 
> user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-486) Improve documentation for using HiveIncrementalPuller

2019-12-30 Thread Pratyaksh Sharma (Jira)
Pratyaksh Sharma created HUDI-486:
-

 Summary: Improve documentation for using HiveIncrementalPuller
 Key: HUDI-486
 URL: https://issues.apache.org/jira/browse/HUDI-486
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Incremental Pull
Reporter: Pratyaksh Sharma
Assignee: Pratyaksh Sharma


For using HiveIncrementalPuller, one needs to have a lot of jars in classPath. 
These jars are not listed anywhere. As a result, one has to keep on adding the 
jars incrementally to the classPath with every NoClassDefFoundError coming up 
when executing. 

We should list down the jars needed so that it becomes easy for a first-time 
user to use the mentioned tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-485) Check for where clause is wrong in HiveIncrementalPuller

2019-12-30 Thread Pratyaksh Sharma (Jira)
Pratyaksh Sharma created HUDI-485:
-

 Summary: Check for where clause is wrong in HiveIncrementalPuller
 Key: HUDI-485
 URL: https://issues.apache.org/jira/browse/HUDI-485
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Incremental Pull, newbie
Reporter: Pratyaksh Sharma
Assignee: Pratyaksh Sharma


HiveIncrementalPuller checks the clause in incrementalSqlFile like this -> 

if (!incrementalSQL.contains("`_hoodie_commit_time` > '%targetBasePath'")) {
 LOG.info("Incremental SQL : " + incrementalSQL
 + " does not contain `_hoodie_commit_time` > %targetBasePath. Please add "
 + "this clause for incremental to work properly.");
 throw new HoodieIncrementalPullSQLException(
 "Incremental SQL does not have clause `_hoodie_commit_time` > 
'%targetBasePath', which "
 + "means its not pulling incrementally");
}

Basically we are trying to add a placeholder here which is later replaced with 
config.fromCommitTime here - 

incrementalPullSQLtemplate.add("incrementalSQL", String.format(incrementalSQL, 
config.fromCommitTime));

Hence, the above check needs to replaced with `_hoodie_commit_time` > 
%targetBasePath



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-484) NPE in HiveIncrementalPuller

2019-12-30 Thread Pratyaksh Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma reassigned HUDI-484:
-

Assignee: Pratyaksh Sharma

> NPE in HiveIncrementalPuller
> 
>
> Key: HUDI-484
> URL: https://issues.apache.org/jira/browse/HUDI-484
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Screenshot 2019-12-30 at 4.43.51 PM.png
>
>
> When we try to use HiveIncrementalPuller class to incrementally pull changes 
> from hive, it throws NPE as it is unable to find IncrementalPull.sqltemplate 
> in the bundled jar. 
> Screenshot attached which shows the exception. 
> The jar contains the template. 
> Steps to reproduce - 
>  # copy hive-jdbc-2.3.1.jar, log4j-1.2.17.jar to docker/demo/config folder
>  # run cd docker && ./setup_demo.sh
>  # cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks 
> -P
>  #  {{docker exec -it adhoc-2 /bin/bash}}
>  #  {{spark-submit --class 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
> org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts 
> --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
> stock_ticks_cow --props /var/demo/config/kafka-source.properties 
> --schemaprovider-class 
> org.apache.hudi.utilities.schema.FilebasedSchemaProvider}}
>  #  {{/var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url 
> jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt 
> --base-path /user/hive/warehouse/stock_ticks_cow --database default --table 
> stock_ticks_cow}}
>  # java -cp 
> /var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:$HUDI_UTILITIES_BUNDLE
>  org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl 
> jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive 
> --extractSQLFile /var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb 
> default --sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable 
> --fromCommitTime 0 --maxCommits 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-484) NPE in HiveIncrementalPuller

2019-12-30 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005374#comment-17005374
 ] 

Pratyaksh Sharma commented on HUDI-484:
---

The above problem gets fixed by changing the line 

String templateContent = 
FileIOUtils.readAsUTFString(this.getClass().getResourceAsStream("IncrementalPull.sqltemplate"));

to 

String templateContent = 
FileIOUtils.readAsUTFString(this.getClass().getResourceAsStream("/IncrementalPull.sqltemplate"));

> NPE in HiveIncrementalPuller
> 
>
> Key: HUDI-484
> URL: https://issues.apache.org/jira/browse/HUDI-484
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Screenshot 2019-12-30 at 4.43.51 PM.png
>
>
> When we try to use HiveIncrementalPuller class to incrementally pull changes 
> from hive, it throws NPE as it is unable to find IncrementalPull.sqltemplate 
> in the bundled jar. 
> Screenshot attached which shows the exception. 
> The jar contains the template. 
> Steps to reproduce - 
>  # copy hive-jdbc-2.3.1.jar, log4j-1.2.17.jar to docker/demo/config folder
>  # run cd docker && ./setup_demo.sh
>  # cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks 
> -P
>  #  {{docker exec -it adhoc-2 /bin/bash}}
>  #  {{spark-submit --class 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
> org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts 
> --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
> stock_ticks_cow --props /var/demo/config/kafka-source.properties 
> --schemaprovider-class 
> org.apache.hudi.utilities.schema.FilebasedSchemaProvider}}
>  #  {{/var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url 
> jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt 
> --base-path /user/hive/warehouse/stock_ticks_cow --database default --table 
> stock_ticks_cow}}
>  # java -cp 
> /var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:$HUDI_UTILITIES_BUNDLE
>  org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl 
> jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive 
> --extractSQLFile /var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb 
> default --sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable 
> --fromCommitTime 0 --maxCommits 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] dengziming commented on a change in pull request #1158: [MINOR]: alter some wrong params which bring fatal exception

2019-12-30 Thread GitBox
dengziming commented on a change in pull request #1158: [MINOR]: alter some 
wrong params which bring fatal exception
URL: https://github.com/apache/incubator-hudi/pull/1158#discussion_r361977514
 
 

 ##
 File path: hudi-client/src/test/java/HoodieClientExample.java
 ##
 @@ -95,7 +95,7 @@ public void run() throws Exception {
 HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder().withPath(tablePath)
 
.withSchema(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA).withParallelism(2, 
2).forTable(tableName)
 
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(IndexType.BLOOM).build())
-
.withCompactionConfig(HoodieCompactionConfig.newBuilder().archiveCommitsWith(2, 
3).build()).build();
+
.withCompactionConfig(HoodieCompactionConfig.newBuilder().archiveCommitsWith(20,
 30).build()).build();
 
 Review comment:
   if we set minToKeep=2, the app will exit with 
`java.lang.IllegalArgumentException: Increase hoodie.keep.min.commits=2 to be 
greater than hoodie.cleaner.commits.retained=10. Otherwise, there is risk of 
incremental pull missing data from few instants`.
   just set to DEFAULT_MIN_COMMITS_TO_KEEP and DEFAULT_MAX_COMMITS_TO_KEEP


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] dengziming opened a new pull request #1158: [MINOR]: alter some wrong params which bring fatal exception

2019-12-30 Thread GitBox
dengziming opened a new pull request #1158: [MINOR]: alter some wrong params 
which bring fatal exception
URL: https://github.com/apache/incubator-hudi/pull/1158
 
 
   
   ## What is the purpose of the pull request
   
   the `HoodieCompactionConfig.build()` method will check arguments to ensure 
MIN_COMMITS_TO_KEEP_PROP > CLEANER_COMMITS_RETAINED_PROP, and 
DEFAULT_CLEANER_COMMITS_RETAINED = "10", so we need to set 
MIN_COMMITS_TO_KEEP_PROP at least 11.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-471) hudi quickstart spark-shell local

2019-12-30 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005295#comment-17005295
 ] 

Pratyaksh Sharma commented on HUDI-471:
---

[~liujinhui] Where exactly did you find this script? Can you please share the 
link of the page? Thanks

> hudi quickstart spark-shell local
> -
>
> Key: HUDI-471
> URL: https://issues.apache.org/jira/browse/HUDI-471
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: liujinhui
>Priority: Minor
>  Labels: docement
> Fix For: 0.5.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I think the following script is not accurate at startup. Need to supplement 
> the startup mode.
> {{bin/spark-shell --packages 
> org.apache.hudi:hudi-spark-bundle:0.5.0-incubating --conf 
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}
> update after
>  
> bin/spark-shell --master local --packages 
> org.apache.hudi:hudi-spark-bundle:0.5.0-incubating --conf 
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>  
>  
> writing this way may be more in line with the learning of novices
>  {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-473) QuickstartUtils

2019-12-30 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005290#comment-17005290
 ] 

Pratyaksh Sharma commented on HUDI-473:
---

[~zhangpu-paul] Can you please elaborate the steps to reproduce? dataGen object 
refers to HoodieTestDataGenerator.java class? Have you done any code 
refactoring where you are resetting numExistingKeys?

Also the stack trace does not match the lines in current master branch. Can you 
please update the stack trace after trying it on current master branch?  

> QuickstartUtils 
> 
>
> Key: HUDI-473
> URL: https://issues.apache.org/jira/browse/HUDI-473
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: zhangpu
>Priority: Minor
>  Labels: starter
>
>  First call dataGen.generateInserts to write the data,Then another process 
> call dataGen.generateUpdates ,Throws the following exception:
> Exception in thread "main" java.lang.IllegalArgumentException: bound must be 
> positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.hudi.QuickstartUtils$DataGenerator.generateUpdates(QuickstartUtils.java:163)
> Is the design reasonable?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-484) NPE in HiveIncrementalPuller

2019-12-30 Thread Pratyaksh Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma updated HUDI-484:
--
Description: 
When we try to use HiveIncrementalPuller class to incrementally pull changes 
from hive, it throws NPE as it is unable to find IncrementalPull.sqltemplate in 
the bundled jar. 

Screenshot attached which shows the exception. 

The jar contains the template. 

Steps to reproduce - 
 # copy hive-jdbc-2.3.1.jar, log4j-1.2.17.jar to docker/demo/config folder
 # run cd docker && ./setup_demo.sh
 # cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
 #  {{docker exec -it adhoc-2 /bin/bash}}
 #  {{spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts 
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties 
--schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider}}
 #  {{/var/hoodie/ws/hudi-hive/run_sync_tool.sh --jdbc-url 
jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt 
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table 
stock_ticks_cow}}
 # java -cp 
/var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:$HUDI_UTILITIES_BUNDLE
 org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl 
jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive --extractSQLFile 
/var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb default 
--sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable 
--fromCommitTime 0 --maxCommits 1

  was:
When we try to use HiveIncrementalPuller class to incrementally pull changes 
from hive, it throws NPE as it is unable to find IncrementalPull.sqltemplate in 
the bundled jar. 

Screenshot attached which shows the exception. 

The jar contains the template. 

Steps to reproduce - 
 # copy hive-jdbc-2.3.1.jar, log4j-1.2.17.jar to docker/demo/config folder
 # run cd docker && ./setup_demo.sh
 # cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
 #  {{docker exec -it adhoc-2 /bin/bash}}
 #  {{spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties 
--schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider}}
 #  {{/var/hoodie/ws/hudi-hive/run_sync_tool.sh  --jdbc-url 
jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt 
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table 
stock_ticks_cow}}
 # java -cp 
/var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:$HUDI_UTILITIES_BUNDLE
 org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl 
jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive --extractSQLFile 
/var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb default 
--sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable 
--fromCommitTime 0 --maxCommits 1{{}}


> NPE in HiveIncrementalPuller
> 
>
> Key: HUDI-484
> URL: https://issues.apache.org/jira/browse/HUDI-484
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Pratyaksh Sharma
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Screenshot 2019-12-30 at 4.43.51 PM.png
>
>
> When we try to use HiveIncrementalPuller class to incrementally pull changes 
> from hive, it throws NPE as it is unable to find IncrementalPull.sqltemplate 
> in the bundled jar. 
> Screenshot attached which shows the exception. 
> The jar contains the template. 
> Steps to reproduce - 
>  # copy hive-jdbc-2.3.1.jar, log4j-1.2.17.jar to docker/demo/config folder
>  # run cd docker && ./setup_demo.sh
>  # cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks 
> -P
>  #  {{docker exec -it adhoc-2 /bin/bash}}
>  #  {{spark-submit --class 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> $HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
> org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts 
> --target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
> stock_ticks_cow --props /var/demo/config/kafka-source.properties 
> --schemaprovider-class 
> org.apache.hudi.utilities.schema.FilebasedSchemaProvider}}
> 

[jira] [Created] (HUDI-484) NPE in HiveIncrementalPuller

2019-12-30 Thread Pratyaksh Sharma (Jira)
Pratyaksh Sharma created HUDI-484:
-

 Summary: NPE in HiveIncrementalPuller
 Key: HUDI-484
 URL: https://issues.apache.org/jira/browse/HUDI-484
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Incremental Pull
Reporter: Pratyaksh Sharma
 Fix For: 0.5.1
 Attachments: Screenshot 2019-12-30 at 4.43.51 PM.png

When we try to use HiveIncrementalPuller class to incrementally pull changes 
from hive, it throws NPE as it is unable to find IncrementalPull.sqltemplate in 
the bundled jar. 

Screenshot attached which shows the exception. 

The jar contains the template. 

Steps to reproduce - 
 # copy hive-jdbc-2.3.1.jar, log4j-1.2.17.jar to docker/demo/config folder
 # run cd docker && ./setup_demo.sh
 # cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
 #  {{docker exec -it adhoc-2 /bin/bash}}
 #  {{spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE --storage-type COPY_ON_WRITE --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts  
--target-base-path /user/hive/warehouse/stock_ticks_cow --target-table 
stock_ticks_cow --props /var/demo/config/kafka-source.properties 
--schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider}}
 #  {{/var/hoodie/ws/hudi-hive/run_sync_tool.sh  --jdbc-url 
jdbc:hive2://hiveserver:1 --user hive --pass hive --partitioned-by dt 
--base-path /user/hive/warehouse/stock_ticks_cow --database default --table 
stock_ticks_cow}}
 # java -cp 
/var/hoodie/ws/docker/demo/config/hive-jdbc-2.3.1.jar:/var/hoodie/ws/docker/demo/config/log4j-1.2.17.jar:$HUDI_UTILITIES_BUNDLE
 org.apache.hudi.utilities.HiveIncrementalPuller --hiveUrl 
jdbc:hive2://hiveserver:1 --hiveUser hive --hivePass hive --extractSQLFile 
/var/hoodie/ws/docker/demo/config/incr_pull.txt --sourceDb default 
--sourceTable stock_ticks_cow --targetDb tmp --targetTable tempTable 
--fromCommitTime 0 --maxCommits 1{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] hmatu commented on issue #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2019-12-30 Thread GitBox
hmatu commented on issue #1157: [HUDI-332]Add operation type 
(insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#issuecomment-569622110
 
 
   @hddong can you explain why add operation type to HoodieCommitMetadata? 
thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-474) Delta Streamer is not able to read the commit files

2019-12-30 Thread Shahida Khan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005213#comment-17005213
 ] 

Shahida Khan commented on HUDI-474:
---

[~vbalaji] : I have share the Stack Trace again with you.
also regarding your point of using the config "hoodie.timeline.layout.version", 
I haven't use the same, and neither the old version.

> Delta Streamer is not able to read the commit files
> ---
>
> Key: HUDI-474
> URL: https://issues.apache.org/jira/browse/HUDI-474
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Shahida Khan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Gmail - Commit time issue in DeltaStreamer 
> (Real-Time).pdf
>
>
> DeltaStreamer is not to able to read the correct commit files under when job 
> is deployed realtime.
> below is the stack trace: 
> {code:java}
> ava.util.concurrent.ExecutionException:
>  org.apache.hudi.exception.HoodieException: Could not read commit
>  details from 
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
>       at
>  java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
>    at
>  java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at
>  
> org.apache.hudi.utilities.deltastreamer.AbstractDeltaStreamerService.waitForShutdown(AbstractDeltaStreamerService.java:72)
>       at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:117)
>   at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:297)
>   at
>  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at
>  
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>   at
>  
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    at
>  java.lang.reflect.Method.invoke(Method.java:498)        at
>  
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)Caused
>  by: org.apache.hudi.exception.HoodieException: Could not read commit
>  details from 
> hdfs:/user/hive/warehouse/hudi.db/tbltest/.hoodie/.aux/20191226153400.clean.requested
>       at
>  
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:411)
>         at
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>      at
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at
>  java.lang.Thread.run(Thread.java:748)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] XuQianJin-Stars commented on a change in pull request #1152: [HUDI-454] Redo hudi-cli log statements using SLF4J

2019-12-30 Thread GitBox
XuQianJin-Stars commented on a change in pull request #1152: [HUDI-454] Redo 
hudi-cli log statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1152#discussion_r361928359
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/utils/InputStreamConsumer.java
 ##
 @@ -46,7 +47,7 @@ public void run() {
 LOG.info(line);
   }
 } catch (IOException ioe) {
-  LOG.severe(ioe.toString());
+  LOG.error(ioe.toString());
 
 Review comment:
   yes, you are right, The method `ToString()` can remove.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1152: [HUDI-454] Redo hudi-cli log statements using SLF4J

2019-12-30 Thread GitBox
yanghua commented on a change in pull request #1152: [HUDI-454] Redo hudi-cli 
log statements using SLF4J
URL: https://github.com/apache/incubator-hudi/pull/1152#discussion_r361926941
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/utils/InputStreamConsumer.java
 ##
 @@ -46,7 +47,7 @@ public void run() {
 LOG.info(line);
   }
 } catch (IOException ioe) {
-  LOG.severe(ioe.toString());
+  LOG.error(ioe.toString());
 
 Review comment:
   I have no doubt about `SEVERE -> ERROR`. What I mean is that can we remove 
ioe ~.toString()~?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (HUDI-477) Add HoodieClient Example code to hudi-examples

2019-12-30 Thread dengziming (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dengziming reassigned HUDI-477:
---

Assignee: dengziming

> Add HoodieClient Example code to hudi-examples
> --
>
> Key: HUDI-477
> URL: https://issues.apache.org/jira/browse/HUDI-477
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
>  Labels: starter
>
> the code incubator-hudi/hudi-client/src/test/java/HoodieClientExample.java 
> could be reused, but it relies on 2 test class: HoodieTestDataGenerator and 
> HoodieClientTestUtils, so it may be complex.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-453) Throw failed to archive commits error when writing data to MOR/COW table

2019-12-30 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005188#comment-17005188
 ] 

lamber-ken commented on HUDI-453:
-

Hi [~gurudatt], fixed at https://issues.apache.org/jira/browse/HUDI-453

> Throw failed to archive commits error when writing data to MOR/COW table
> 
>
> Key: HUDI-453
> URL: https://issues.apache.org/jira/browse/HUDI-453
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Throw failed to archive commits error when writing data to table, here are 
> reproduce steps.
> *1, Build from latest source*
> {code:java}
> mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true
> {code}
> *2, Write Data*
> {code:java}
> export SPARK_HOME=/work/BigData/install/spark/spark-2.3.3-bin-hadoop2.6
> ${SPARK_HOME}/bin/spark-shell --jars `ls 
> packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar` 
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> import org.apache.spark.sql.SaveMode._
> var datas = List("{ \"name\": \"kenken\", \"ts\": 1574297893836, \"age\": 12, 
> \"location\": \"latitude\"}")
> val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
> df.write.format("org.apache.hudi").
> option("hoodie.insert.shuffle.parallelism", "10").
> option("hoodie.upsert.shuffle.parallelism", "10").
> option("hoodie.delete.shuffle.parallelism", "10").
> option("hoodie.bulkinsert.shuffle.parallelism", "10").
> option("hoodie.datasource.write.recordkey.field", "name").
> option("hoodie.datasource.write.partitionpath.field", "location").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.table.name", "hudi_mor_table").
> mode(Overwrite).
> save("file:///tmp/hudi_mor_table")
> {code}
> *3, Append Data*
> {code:java}
> df.write.format("org.apache.hudi").
> option("hoodie.insert.shuffle.parallelism", "10").
> option("hoodie.upsert.shuffle.parallelism", "10").
> option("hoodie.delete.shuffle.parallelism", "10").
> option("hoodie.bulkinsert.shuffle.parallelism", "10").
> option("hoodie.datasource.write.recordkey.field", "name").
> option("hoodie.datasource.write.partitionpath.field", "location").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.keep.max.commits", "5").
> option("hoodie.keep.min.commits", "4").
> option("hoodie.cleaner.commits.retained", "3").
> option("hoodie.table.name", "hudi_mor_table").
> mode(Append).
> save("file:///tmp/hudi_mor_table")
> {code}
> *4, Repeat about six times Append Data operation(above), will get the 
> stackstrace*
> {code:java}
> 19/12/23 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
> .commit file: 20191224004558.clean.requested
> java.io.IOException: Not an Avro data file
> at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
> at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
> at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
>