date:20200120

yanghua merged pull request #1263: [MINOR] Update the javadoc of 
HoodieTableMetaClient#scanFiles
URL: https://github.com/apache/incubator-hudi/pull/1263
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated (924bf51 -> b6e2993)

2020-01-20 Thread vinoyang

This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 924bf51  [MINOR] Download KEYS file when validating release candidate 
(#1259)
 add b6e2993  [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles 
(#1263)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/common/table/HoodieTableMetaClient.java| 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

[GitHub] [incubator-hudi] wangxianghu opened a new pull request #1264: [HUDI-560] Remove legacy IdentityTransformer

wangxianghu opened a new pull request #1264: [HUDI-560] Remove legacy 
IdentityTransformer
URL: https://github.com/apache/incubator-hudi/pull/1264
 
 
   ## What is the purpose of the pull request
   
   *Remove legacy IdentityTransformer*
   
   ## Brief change log
   
   *Remove legacy IdentityTransformer*
   
   ## Verify this pull request
   
   This pull request is a code cleanup without any test coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-560) Remove legacy IdentityTransformer

2020-01-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-560:

Labels: pull-request-available  (was: )

> Remove legacy IdentityTransformer
> -
>
> Key: HUDI-560
> URL: https://issues.apache.org/jira/browse/HUDI-560
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
>
> Currently, {{IdentityTransformer}} has not been used anywhere in Hudi 
> codebase. And it seems it's just like a  pass-through transformer. Can we 
> remove it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] wangxianghu commented on issue #1263: [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles

wangxianghu commented on issue #1263: [MINOR] Update the javadoc of 
HoodieTableMetaClient#scanFiles
URL: https://github.com/apache/incubator-hudi/pull/1263#issuecomment-576540719
 
 
   @yanghua OK, Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-561) hudi partition path config

2020-01-20 Thread liujinhui (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019886#comment-17019886
 ] 

liujinhui commented on HUDI-561:


There is also a problem with using the transformer, it will modify the data 
written into the hudi, so it is not very friendly

> hudi partition path config
> --
>
> Key: HUDI-561
> URL: https://issues.apache.org/jira/browse/HUDI-561
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.5.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The current hudi partition is in accordance with 
> hoodie.datasource.write.partitionpath.field = keyname
> example:
> keyname 2019/12/20
> Usually the time format may be -MM-dd HH: mm: ss or other
> -MM-dd HH: mm: ss cannot be partitioned correctly
> So I want to add configuration :
> hoodie.datasource.write.partitionpath.source.format = -MM-dd HH: mm: ss
> hoodie.datasource.write.partitionpath.target.format =  / MM / dd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] lamber-ken closed pull request #1245: [MINOR] Replace Collection.size > 0 with Collection.isEmpty()

lamber-ken closed pull request #1245: [MINOR] Replace Collection.size > 0 with 
Collection.isEmpty()
URL: https://github.com/apache/incubator-hudi/pull/1245
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] wangxianghu commented on a change in pull request #1263: [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles

wangxianghu commented on a change in pull request #1263: [MINOR] Update the 
javadoc of HoodieTableMetaClient#scanFiles
URL: https://github.com/apache/incubator-hudi/pull/1263#discussion_r368829172
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -381,7 +381,15 @@ public static HoodieTableMetaClient 
initTableAndGetMetaClient(Configuration hado
 return metaClient;
   }
 
-  // HELPER METHODS TO CREATE META FILE NAMES
+  /**
+   * Helper method to scan all hoodie-instant metafiles.
+   *
+   * @param fs Fs implementation for this table
+   * @param metaPath MetaPath where meta files are stored
+   * @param nameFilter NameFilter to filter meta files
 
 Review comment:
   @yanghua，Thanks for your suggestion, I'll fix it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-561) hudi partition path config



 [ 
https://issues.apache.org/jira/browse/HUDI-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-561:
--
Description: 
The current hudi partition is in accordance with 
hoodie.datasource.write.partitionpath.field = keyname


example:

keyname 2019/12/20

Usually the time format may be -MM-dd HH: mm: ss or other
-MM-dd HH: mm: ss cannot be partitioned correctly
So I want to add configuration :

hoodie.datasource.write.partitionpath.source.format = -MM-dd HH: mm: ss
hoodie.datasource.write.partitionpath.target.format =  / MM / dd

  was:
The current hudi partition is in accordance with 
hoodie.datasource.write.partitionpath.field = keyname


example:

keyname 2019/12/20

Usually the time format may be -MM-dd HH: mm: ss or other
-MM-dd HH: mm: ss cannot be partitioned correctly
So I want to add configuration :

hoodie.datasource.write.partitionpath.source.formate = -MM-dd HH: mm: ss
hoodie.datasource.write.partitionpath.target.formate =  / MM / dd


> hudi partition path config
> --
>
> Key: HUDI-561
> URL: https://issues.apache.org/jira/browse/HUDI-561
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.5.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The current hudi partition is in accordance with 
> hoodie.datasource.write.partitionpath.field = keyname
> example:
> keyname 2019/12/20
> Usually the time format may be -MM-dd HH: mm: ss or other
> -MM-dd HH: mm: ss cannot be partitioned correctly
> So I want to add configuration :
> hoodie.datasource.write.partitionpath.source.format = -MM-dd HH: mm: ss
> hoodie.datasource.write.partitionpath.target.format =  / MM / dd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-561) hudi partition path config



[ 
https://issues.apache.org/jira/browse/HUDI-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019882#comment-17019882
 ] 

vinoyang commented on HUDI-561:
---

[~liujinhui] Sounds Reasonable. Although we can use customized {{Transformer}} 
to do the conversion, it is still not very convenient. Having a format config 
option, it would be easier for us to handle different type of dates and times. 

> hudi partition path config
> --
>
> Key: HUDI-561
> URL: https://issues.apache.org/jira/browse/HUDI-561
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.5.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The current hudi partition is in accordance with 
> hoodie.datasource.write.partitionpath.field = keyname
> example:
> keyname 2019/12/20
> Usually the time format may be -MM-dd HH: mm: ss or other
> -MM-dd HH: mm: ss cannot be partitioned correctly
> So I want to add configuration :
> hoodie.datasource.write.partitionpath.source.formate = -MM-dd HH: mm: ss
> hoodie.datasource.write.partitionpath.target.formate =  / MM / dd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] wangxianghu commented on a change in pull request #1263: [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles

wangxianghu commented on a change in pull request #1263: [MINOR] Update the 
javadoc of HoodieTableMetaClient#scanFiles
URL: https://github.com/apache/incubator-hudi/pull/1263#discussion_r368829172
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -381,7 +381,15 @@ public static HoodieTableMetaClient 
initTableAndGetMetaClient(Configuration hado
 return metaClient;
   }
 
-  // HELPER METHODS TO CREATE META FILE NAMES
+  /**
+   * Helper method to scan all hoodie-instant metafiles.
+   *
+   * @param fs Fs implementation for this table
+   * @param metaPath MetaPath where meta files are stored
+   * @param nameFilter NameFilter to filter meta files
 
 Review comment:
   @yanghua，Thanks for your suggestion, I'll fixed it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1263: [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles

yanghua commented on a change in pull request #1263: [MINOR] Update the javadoc 
of HoodieTableMetaClient#scanFiles
URL: https://github.com/apache/incubator-hudi/pull/1263#discussion_r368827950
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -381,7 +381,15 @@ public static HoodieTableMetaClient 
initTableAndGetMetaClient(Configuration hado
 return metaClient;
   }
 
-  // HELPER METHODS TO CREATE META FILE NAMES
+  /**
+   * Helper method to scan all hoodie-instant metafiles.
+   *
+   * @param fs Fs implementation for this table
+   * @param metaPath MetaPath where meta files are stored
+   * @param nameFilter NameFilter to filter meta files
 
 Review comment:
   I'd suggest changing `Fs`, `MetaPath` and `NameFilter` to `The file system`, 
`The meta path` and `The name filter`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-560) Remove legacy IdentityTransformer



 [ 
https://issues.apache.org/jira/browse/HUDI-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-560:
-
Issue Type: Improvement  (was: Wish)

> Remove legacy IdentityTransformer
> -
>
> Key: HUDI-560
> URL: https://issues.apache.org/jira/browse/HUDI-560
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>
> Currently, {{IdentityTransformer}} has not been used anywhere in Hudi 
> codebase. And it seems it's just like a  pass-through transformer. Can we 
> remove it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-532) Add java doc for hudi test suite test classes



 [ 
https://issues.apache.org/jira/browse/HUDI-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-532:
-
Status: Open  (was: New)

> Add java doc for hudi test suite test classes
> -
>
> Key: HUDI-532
> URL: https://issues.apache.org/jira/browse/HUDI-532
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>
> Currently, the test classes(under test/java dir) has no java docs. We should 
> add more doc for those classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-560) Remove legacy IdentityTransformer



 [ 
https://issues.apache.org/jira/browse/HUDI-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-560:
-
Status: In Progress  (was: Open)

> Remove legacy IdentityTransformer
> -
>
> Key: HUDI-560
> URL: https://issues.apache.org/jira/browse/HUDI-560
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>
> Currently, {{IdentityTransformer}} has not been used anywhere in Hudi 
> codebase. And it seems it's just like a  pass-through transformer. Can we 
> remove it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-531) Add java doc for hudi test suite general classes



 [ 
https://issues.apache.org/jira/browse/HUDI-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-531:
-
Status: Open  (was: New)

> Add java doc for hudi test suite general classes
> 
>
> Key: HUDI-531
> URL: https://issues.apache.org/jira/browse/HUDI-531
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>
> Currently, the general classes (under src/main dir) has no java docs. We 
> should add doc for those classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-560) Remove legacy IdentityTransformer



 [ 
https://issues.apache.org/jira/browse/HUDI-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-560:
-
Status: Open  (was: New)

> Remove legacy IdentityTransformer
> -
>
> Key: HUDI-560
> URL: https://issues.apache.org/jira/browse/HUDI-560
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>
> Currently, {{IdentityTransformer}} has not been used anywhere in Hudi 
> codebase. And it seems it's just like a  pass-through transformer. Can we 
> remove it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-560) Remove legacy IdentityTransformer



[ 
https://issues.apache.org/jira/browse/HUDI-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019871#comment-17019871
 ] 

wangxianghu commented on HUDI-560:
--

[~yanghua] copy that.

> Remove legacy IdentityTransformer
> -
>
> Key: HUDI-560
> URL: https://issues.apache.org/jira/browse/HUDI-560
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>
> Currently, {{IdentityTransformer}} has not been used anywhere in Hudi 
> codebase. And it seems it's just like a  pass-through transformer. Can we 
> remove it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] wangxianghu commented on a change in pull request #1263: [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles

wangxianghu commented on a change in pull request #1263: [MINOR] Update the 
javadoc of HoodieTableMetaClient#scanFiles
URL: https://github.com/apache/incubator-hudi/pull/1263#discussion_r368824428
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -381,7 +381,15 @@ public static HoodieTableMetaClient 
initTableAndGetMetaClient(Configuration hado
 return metaClient;
   }
 
-  // HELPER METHODS TO CREATE META FILE NAMES
+  /**
+   * Helper methods to create meta file names.
 
 Review comment:
   @yanghua  PTAL


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] dengziming removed a comment on issue #1151: [WIP] [HUDI-476] Add hudi-examples module

dengziming removed a comment on issue #1151: [WIP] [HUDI-476] Add hudi-examples 
module
URL: https://github.com/apache/incubator-hudi/pull/1151#issuecomment-576130421
 
 
   @vinothchandar hi, vinoth, I have added the DeltaStreamExample.
   And I run `mvn test -B` successful locally, but the Travis CI build failed 
with a:
   ```
   [ERROR] Failed to execute goal on project hudi-examples: Could not resolve 
dependencies for project org.apache.hudi:hudi-examples:jar:0.5.1-SNAPSHOT: The 
following artifacts could not be resolved: 
org.apache.hudi:hudi-utilities:jar:0.5.1-SNAPSHOT, 
org.apache.hudi:hudi-spark:jar:0.5.1-SNAPSHOT: Failure to find 
org.apache.hudi:hudi-utilities:jar:0.5.1-SNAPSHOT in 
https://oss.sonatype.org/content/repositories/snapshots/ was cached in the 
local repository, resolution will not be reattempted until the update interval 
of sonatype-snapshots has elapsed or updates are forced -> [Help 1]
   ```
   I searched for this error and found it could be solved by deleting the file  
cached in the local repository, but I don't have the privilege, could you help 
me to solve this problem.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1151: [WIP] [HUDI-476] Add hudi-examples module

pratyakshsharma commented on a change in pull request #1151: [WIP] [HUDI-476] 
Add hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#discussion_r368608157
 
 

 ##
 File path: 
hudi-examples/src/main/java/org/apache/hudi/examples/common/HoodieExampleDataGenerator.java
 ##
 @@ -0,0 +1,216 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.examples.common;
+
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.HoodieAvroUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+import java.util.UUID;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import java.util.stream.Stream;
+
+
+/**
+ * Class to be used to generate test data.
+ */
+public class HoodieExampleDataGenerator> {
 
 Review comment:
   Can you see if we can do away with duplicate code here by extending 
HoodieTestDataGenerator?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

lamber-ken commented on a change in pull request #1260: [WIP] [HUDI-510] Update 
site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368818803
 
 

 ##
 File path: docs/_docs/1_3_use_cases.md
 ##
 @@ -20,7 +20,7 @@ or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-st
 For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
 It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+Even for immutable data sources like [Kafka](http://kafka.apache.org) , Hudi 
helps __enforces a minimum file size on HDFS__, which improves NameNode health 
by solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
 
 Review comment:
   Good catch, `https://kafka.apache.org` is better.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds 
guidelines on deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#discussion_r368818614
 
 

 ##
 File path: docs/_docs/2_6_deployment.md
 ##
 @@ -1,51 +1,87 @@
 ---
-title: Administering Hudi Pipelines
-keywords: hudi, administration, operation, devops
-permalink: /docs/admin_guide.html
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/deployment.html
+summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+Specifically, we will cover the following aspects.
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
 
 Review comment:
   > THanks @lamber-ken I will be adding a followup PR to add more details 
around compaction and deltastreamer. I will address these comments as part of 
the follow-up PR.
   
   You're welcome  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] leesf commented on issue #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

leesf commented on issue #1260: [WIP] [HUDI-510] Update site documentation in 
sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#issuecomment-576524810
 
 
   > @leesf / @yanghua can you please help review this PR. Also, this might be 
needed in the corresponding cn pages too. Need your help there as well. Thanks!
   
   Hi @bhasudha . Would just go ahead, we would make a follow-up PR to cn pages.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: [HUDI-403] Adds guidelines on deployment/upgrading

2020-01-20 Thread vbalaji

This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 39db1ae  [HUDI-403] Adds guidelines on deployment/upgrading
39db1ae is described below

commit 39db1aedbb9cb4533c58f869d5a940fbc1a3e5d2
Author: vinothchandar 
AuthorDate: Mon Jan 20 18:01:19 2020 -0800

[HUDI-403] Adds guidelines on deployment/upgrading

 - Moved "Adminsitering" page to "Deployment"
 - Still need to add information on deltastreamer modes/compaction
---
 docs/_data/navigation.yml  |   6 +-
 docs/_docs/2_2_writing_data.md |   2 +-
 ...{2_6_admin_guide.cn.md => 2_6_deployment.cn.md} |   2 +-
 .../{2_6_admin_guide.md => 2_6_deployment.md}  | 173 -
 4 files changed, 107 insertions(+), 76 deletions(-)

diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml
index d2826dd..a3510f4 100644
--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@@ -38,8 +38,8 @@ docs:
 url: /docs/configurations.html
   - title: "Performance"
 url: /docs/performance.html
-  - title: "Administering"
-url: /docs/admin_guide.html
+  - title: "Deployment"
+url: /docs/deployment.html
   - title: INFO
 children:
   - title: "Docs Versions"
@@ -86,7 +86,7 @@ cn_docs:
   - title: "性能"
 url: /cn/docs/performance.html
   - title: "管理"
-url: /cn/docs/admin_guide.html
+url: /cn/docs/deployment.html
   - title: 其他信息
 children:
   - title: "文档版本"
diff --git a/docs/_docs/2_2_writing_data.md b/docs/_docs/2_2_writing_data.md
index c0f5184..832daa6 100644
--- a/docs/_docs/2_2_writing_data.md
+++ b/docs/_docs/2_2_writing_data.md
@@ -8,7 +8,7 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
 In this section, we will cover ways to ingest new changes from external 
sources or even other Hudi datasets using the [DeltaStreamer](#deltastreamer) 
tool, as well as 
-speeding up large Spark jobs via upserts using the [Hudi 
datasource](#datasource-writer). Such datasets can then be 
[queried](querying_data.html) using various query engines.
+speeding up large Spark jobs via upserts using the [Hudi 
datasource](#datasource-writer). Such datasets can then be 
[queried](/docs/querying_data.html) using various query engines.
 
 
 ## Write Operations
diff --git a/docs/_docs/2_6_admin_guide.cn.md b/docs/_docs/2_6_deployment.cn.md
similarity index 99%
rename from docs/_docs/2_6_admin_guide.cn.md
rename to docs/_docs/2_6_deployment.cn.md
index ce055f0..c555b54 100644
--- a/docs/_docs/2_6_admin_guide.cn.md
+++ b/docs/_docs/2_6_deployment.cn.md
@@ -1,7 +1,7 @@
 ---
 title: 管理 Hudi Pipelines
 keywords: hudi, administration, operation, devops
-permalink: /cn/docs/admin_guide.html
+permalink: /cn/docs/deployment.html
 summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
diff --git a/docs/_docs/2_6_admin_guide.md b/docs/_docs/2_6_deployment.md
similarity index 73%
rename from docs/_docs/2_6_admin_guide.md
rename to docs/_docs/2_6_deployment.md
index 6990f50..295f8e8 100644
--- a/docs/_docs/2_6_admin_guide.md
+++ b/docs/_docs/2_6_deployment.md
@@ -1,51 +1,87 @@
 ---
-title: Administering Hudi Pipelines
-keywords: hudi, administration, operation, devops
-permalink: /docs/admin_guide.html
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/deployment.html
+summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+Specifically, we will cover the following aspects.
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
+ - [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
+ - [Interacting via CLI](#cli) : Using the CLI to perform maintenance or 
deeper introspection
+ - [Monitoring](#monitoring) : Tracking metrics from your hudi tables using 
popular tools.
+ - [Troubleshooting](#troubleshooting) : Uncovering, triaging and resolving 
issues in production.
+ 
+## Deploying
 
-This section provides a glimpse into each of these, with some general guidance 
on

[GitHub] [incubator-hudi] bvaradar merged pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

bvaradar merged pull request #1261: [HUDI-403] Adds guidelines on 
deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

bvaradar commented on a change in pull request #1261: [HUDI-403] Adds 
guidelines on deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#discussion_r368812781
 
 

 ##
 File path: docs/_docs/2_6_deployment.md
 ##
 @@ -1,51 +1,87 @@
 ---
-title: Administering Hudi Pipelines
-keywords: hudi, administration, operation, devops
-permalink: /docs/admin_guide.html
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/deployment.html
+summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+Specifically, we will cover the following aspects.
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
 
 Review comment:
   THanks @lamber-ken I will be adding a followup PR to add more details around 
compaction and deltastreamer. I will address these comments as part of the 
follow-up PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated (4c3cf71 -> 5a23459)

2020-01-20 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 4c3cf71  [HUDI-557] Additional work for supporting multiple version 
docs (#1250)
 add 5a23459  [MINOR] Upper toc_label (#1262)

No new revisions were added by this update.

Summary of changes:
 docs/_data/ui-text.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

[GitHub] [incubator-hudi] vinothchandar merged pull request #1262: [MINOR] Upper toc_label

vinothchandar merged pull request #1262: [MINOR] Upper toc_label
URL: https://github.com/apache/incubator-hudi/pull/1262
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-561) hudi partition path config

2020-01-20 Thread liujinhui (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019841#comment-17019841
 ] 

liujinhui commented on HUDI-561:


[~yanghua] [~vinoth]

> hudi partition path config
> --
>
> Key: HUDI-561
> URL: https://issues.apache.org/jira/browse/HUDI-561
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.5.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The current hudi partition is in accordance with 
> hoodie.datasource.write.partitionpath.field = keyname
> example:
> keyname 2019/12/20
> Usually the time format may be -MM-dd HH: mm: ss or other
> -MM-dd HH: mm: ss cannot be partitioned correctly
> So I want to add configuration :
> hoodie.datasource.write.partitionpath.source.formate = -MM-dd HH: mm: ss
> hoodie.datasource.write.partitionpath.target.formate =  / MM / dd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-561) hudi partition path config

2020-01-20 Thread liujinhui (Jira)

liujinhui created HUDI-561:
--

 Summary: hudi partition path config
 Key: HUDI-561
 URL: https://issues.apache.org/jira/browse/HUDI-561
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: liujinhui
Assignee: liujinhui
 Fix For: 0.5.1


The current hudi partition is in accordance with 
hoodie.datasource.write.partitionpath.field = keyname


example:

keyname 2019/12/20

Usually the time format may be -MM-dd HH: mm: ss or other
-MM-dd HH: mm: ss cannot be partitioned correctly
So I want to add configuration :

hoodie.datasource.write.partitionpath.source.formate = -MM-dd HH: mm: ss
hoodie.datasource.write.partitionpath.target.formate =  / MM / dd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] wangxianghu commented on a change in pull request #1263: [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles

wangxianghu commented on a change in pull request #1263: [MINOR] Update the 
javadoc of HoodieTableMetaClient#scanFiles
URL: https://github.com/apache/incubator-hudi/pull/1263#discussion_r368800615
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -381,7 +381,15 @@ public static HoodieTableMetaClient 
initTableAndGetMetaClient(Configuration hado
 return metaClient;
   }
 
-  // HELPER METHODS TO CREATE META FILE NAMES
+  /**
+   * Helper methods to create meta file names.
 
 Review comment:
   @yanghua Ok


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-560) Remove legacy IdentityTransformer



[ 
https://issues.apache.org/jira/browse/HUDI-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019839#comment-17019839
 ] 

vinoyang commented on HUDI-560:
---

[~vbalaji] OK, thanks. Now [~wangxianghu] you can start this work. Please go 
ahead.

> Remove legacy IdentityTransformer
> -
>
> Key: HUDI-560
> URL: https://issues.apache.org/jira/browse/HUDI-560
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>
> Currently, {{IdentityTransformer}} has not been used anywhere in Hudi 
> codebase. And it seems it's just like a  pass-through transformer. Can we 
> remove it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1263: [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles

yanghua commented on a change in pull request #1263: [MINOR] Update the javadoc 
of HoodieTableMetaClient#scanFiles
URL: https://github.com/apache/incubator-hudi/pull/1263#discussion_r368799864
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -381,7 +381,15 @@ public static HoodieTableMetaClient 
initTableAndGetMetaClient(Configuration hado
 return metaClient;
   }
 
-  // HELPER METHODS TO CREATE META FILE NAMES
+  /**
+   * Helper methods to create meta file names.
 
 Review comment:
   `methods` -> `method`. IMO, the original comment is incorrect. You should 
correct the java doc based on the logic of this method.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #165

2020-01-20 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.04 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4:
bin
boot
conf
lib
LICENSE
NOTICE
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.2-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark_2.11[jar]
[INFO] hudi-utilities_2.11[jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle_2.11 [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle_2.11

[jira] [Commented] (HUDI-560) Remove legacy IdentityTransformer

2020-01-20 Thread Balaji Varadarajan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019838#comment-17019838
 ] 

Balaji Varadarajan commented on HUDI-560:
-

[~yanghua] : This is not needed. Please go ahead and remove it. I had earlier 
planned to use it in unit tests but ended up writing a different transformer. 

> Remove legacy IdentityTransformer
> -
>
> Key: HUDI-560
> URL: https://issues.apache.org/jira/browse/HUDI-560
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>
> Currently, {{IdentityTransformer}} has not been used anywhere in Hudi 
> codebase. And it seems it's just like a  pass-through transformer. Can we 
> remove it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] wangxianghu opened a new pull request #1263: [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles

wangxianghu opened a new pull request #1263: [MINOR] Update the javadoc of 
HoodieTableMetaClient#scanFiles
URL: https://github.com/apache/incubator-hudi/pull/1263
 
 
   ## What is the purpose of the pull request
   
   *Update the javadoc of HoodieTableMetaClient#scanFiles*
   
   ## Brief change log
   
   *Update the javadoc of HoodieTableMetaClient#scanFiles*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds 
guidelines on deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#discussion_r368786695
 
 

 ##
 File path: docs/_docs/2_6_deployment.md
 ##
 @@ -1,51 +1,87 @@
 ---
-title: Administering Hudi Pipelines
-keywords: hudi, administration, operation, devops
-permalink: /docs/admin_guide.html
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/deployment.html
+summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+Specifically, we will cover the following aspects.
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
+ - [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
+ - [Interacting via CLI](#cli) : Using the CLI to perform maintenance or 
deeper introspection
+ - [Monitoring](#monitoring) : Tracking metrics from your hudi tables using 
popular tools.
+ - [Troubleshooting](#troubleshooting) : Uncovering, triaging and resolving 
issues in production.
+ 
+## Deploying
 
-This section provides a glimpse into each of these, with some general guidance 
on [troubleshooting](#troubleshooting)
+All in all, Hudi deploys with no long running servers or additional 
infrastructure cost to your data lake. In fact, Hudi pioneered this model of 
building a transactional distributed storage layer
+using existing infrastructure and its heartening to see other systems adopting 
similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer 
or custom Spark datasource jobs), deployed per standard Apache Spark 
[recommendations](https://spark.apache.org/docs/latest/cluster-overview.html).
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or Presto and hence no additional infrastructure is necessary. 
 
-## Admin CLI
 
-Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`.
-A hudi dataset resides on DFS, in a location referred to as the **basePath** 
and we would need this location in order to connect to a Hudi dataset.
-Hudi library effectively manages this dataset internally, using .hoodie 
subfolder to track all metadata
+## Upgrading 
+
+New Hudi releases are listed on the [releases page](/releases), with detailed 
notes which list all the changes, with highlights in each release. 
+At the end of the day, Hudi is a storage system and with that comes a lot of 
responsibilities, which we take seriously. 
+
+As general guidelines, 
+
+ - We strive to keep all changes backwards compatible (i.e new code can read 
old data/timeline files) and we cannot we will provide upgrade/downgrade tools 
via the CLI
+ - We cannot always guarantee forward compatibility (i.e old code being able 
to read data/timeline files written by a greater version). This is generally 
the norm, since no new features can be built otherwise.
+   However any large such changes, will be turned off by default, for smooth 
transition to newer release. After a few releases and once enough users deem 
the feature stable in production, we will flip the defaults in a subsequent 
release.
+ - Always upgrade the query bundles (mr-bundle, presto-bundle, spark-bundle) 
first and then upgrade the writers (deltastreamer, spark jobs using 
datasource). This often provides the best experience and it's easy to fix 
+   any issues by rolling forward/back the writer code (which typically you 
might have more control over)
+ - With large, feature rich releases we recommend migrating slowly, by first 
testing in staging environments and running your own tests. Upgrading Hudi is 
no different than upgrading any database system.
+
+Note that release notes can override this information with specific 
instructions, applicable on case-by-case basis.
+
+## Migrating
+
+Currently migrating to Hudi can be done using two approaches 
 
 Review comment:
   Hi, miss `.` at the end of statement.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken commented on issue #1261: [HUDI-403] Adds guidelines on 
deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#issuecomment-576500540
 
 
   Hello, I synced changes to https://lamber-ken.github.io/docs/deployment.html 
, it's helpful for reviewing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken removed a comment on issue #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken removed a comment on issue #1261: [HUDI-403] Adds guidelines on 
deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#issuecomment-576493665
 
 
   Hello, I synced changes to https://lamber-ken.github.io/docs/deployment.html 
, it's helpful for reviewing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Comment Edited] (HUDI-84) Benchmark write/read paths on Hudi vs non-Hudi datasets



[ 
https://issues.apache.org/jira/browse/HUDI-84?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019830#comment-17019830
 ] 

Vinoth Chandar edited comment on HUDI-84 at 1/21/20 3:16 AM:
-

Simple command to reproduce the spark df/rdd conversion issue 

 cc [~uditme] [~nishith29] [~vbalaji] if we can find some simple solution for 
this, that would be great. this affects only the datasource write path 
(deltastreamer/rdd are fine) 
{code:java}
val df = spark.read.parquet("file:///tmp/hudi-benchmark/input/*/*.parquet")// 
some input data
df.write.format("parquet").mode("overwrite").save("file:///tmp/parquet-write")

val schema = df.schema
val encoder = 
org.apache.spark.sql.catalyst.encoders.RowEncoder.apply(schema).resolveAndBind()
val df2 = spark.createDataFrame(df.queryExecution.toRdd.map(encoder.fromRow), 
schema)
df2.write.format("parquet").mode("overwrite").save("file:///tmp/parquet-write")
 {code}


was (Author: vc):
Simple command to reproduce the spark df/rdd conversion issue 

 
{code:java}
val df = spark.read.parquet("file:///tmp/hudi-benchmark/input/*/*.parquet")// 
some input data
df.write.format("parquet").mode("overwrite").save("file:///tmp/parquet-write")

val schema = df.schema
val encoder = 
org.apache.spark.sql.catalyst.encoders.RowEncoder.apply(schema).resolveAndBind()
val df2 = spark.createDataFrame(df.queryExecution.toRdd.map(encoder.fromRow), 
schema)
df2.write.format("parquet").mode("overwrite").save("file:///tmp/parquet-write")
 {code}

> Benchmark write/read paths on Hudi vs non-Hudi datasets
> ---
>
> Key: HUDI-84
> URL: https://issues.apache.org/jira/browse/HUDI-84
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: realtime-data-lakes
> Attachments: df-toRdd-write.pdf, df-write-stage.pdf
>
>
> * Index performance
>  * SparkSQL 
> (https://github.com/apache/incubator-hudi/issues/588#issuecomment-468055059)
>  * Query planning Planning 
>  * Bulk_insert, log ingest
>  * upsert, database change log. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-84) Benchmark write/read paths on Hudi vs non-Hudi datasets



[ 
https://issues.apache.org/jira/browse/HUDI-84?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019830#comment-17019830
 ] 

Vinoth Chandar edited comment on HUDI-84 at 1/21/20 3:16 AM:
-

Simple command to reproduce the spark df/rdd conversion issue 

 cc [~uditme] [~nishith29] [~vbalaji] if we can find some simple solution for 
this, that would be great. this affects only the datasource write path 
(deltastreamer/rdd are fine) . Uploaded files with the stage UI.. you can see 
the additional compute overhead.. 
{code:java}
val df = spark.read.parquet("file:///tmp/hudi-benchmark/input/*/*.parquet")// 
some input data
df.write.format("parquet").mode("overwrite").save("file:///tmp/parquet-write")

val schema = df.schema
val encoder = 
org.apache.spark.sql.catalyst.encoders.RowEncoder.apply(schema).resolveAndBind()
val df2 = spark.createDataFrame(df.queryExecution.toRdd.map(encoder.fromRow), 
schema)
df2.write.format("parquet").mode("overwrite").save("file:///tmp/parquet-write")
 {code}


was (Author: vc):
Simple command to reproduce the spark df/rdd conversion issue 

 cc [~uditme] [~nishith29] [~vbalaji] if we can find some simple solution for 
this, that would be great. this affects only the datasource write path 
(deltastreamer/rdd are fine) 
{code:java}
val df = spark.read.parquet("file:///tmp/hudi-benchmark/input/*/*.parquet")// 
some input data
df.write.format("parquet").mode("overwrite").save("file:///tmp/parquet-write")

val schema = df.schema
val encoder = 
org.apache.spark.sql.catalyst.encoders.RowEncoder.apply(schema).resolveAndBind()
val df2 = spark.createDataFrame(df.queryExecution.toRdd.map(encoder.fromRow), 
schema)
df2.write.format("parquet").mode("overwrite").save("file:///tmp/parquet-write")
 {code}

> Benchmark write/read paths on Hudi vs non-Hudi datasets
> ---
>
> Key: HUDI-84
> URL: https://issues.apache.org/jira/browse/HUDI-84
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: realtime-data-lakes
> Attachments: df-toRdd-write.pdf, df-write-stage.pdf
>
>
> * Index performance
>  * SparkSQL 
> (https://github.com/apache/incubator-hudi/issues/588#issuecomment-468055059)
>  * Query planning Planning 
>  * Bulk_insert, log ingest
>  * upsert, database change log. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-84) Benchmark write/read paths on Hudi vs non-Hudi datasets



 [ 
https://issues.apache.org/jira/browse/HUDI-84?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-84:
---
Attachment: df-write-stage.pdf
df-toRdd-write.pdf

> Benchmark write/read paths on Hudi vs non-Hudi datasets
> ---
>
> Key: HUDI-84
> URL: https://issues.apache.org/jira/browse/HUDI-84
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: realtime-data-lakes
> Attachments: df-toRdd-write.pdf, df-write-stage.pdf
>
>
> * Index performance
>  * SparkSQL 
> (https://github.com/apache/incubator-hudi/issues/588#issuecomment-468055059)
>  * Query planning Planning 
>  * Bulk_insert, log ingest
>  * upsert, database change log. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1262: [MINOR] Upper toc_label

lamber-ken opened a new pull request #1262: [MINOR] Upper toc_label
URL: https://github.com/apache/incubator-hudi/pull/1262
 
 
   ## What is the purpose of the pull request
   
   Upper toc_label.
   
   ## Brief change log
   
 - *Upper toc_label.*
   
   ## Verify this pull request
   
   This pull request is rework of web-doc.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   
   
   
![image](https://user-images.githubusercontent.com/20113411/72772586-41a28e80-3c3f-11ea-850b-4ae53324b541.png)
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-84) Benchmark write/read paths on Hudi vs non-Hudi datasets



[ 
https://issues.apache.org/jira/browse/HUDI-84?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019830#comment-17019830
 ] 

Vinoth Chandar commented on HUDI-84:


Simple command to reproduce the spark df/rdd conversion issue 

 
{code:java}
val df = spark.read.parquet("file:///tmp/hudi-benchmark/input/*/*.parquet")// 
some input data
df.write.format("parquet").mode("overwrite").save("file:///tmp/parquet-write")

val schema = df.schema
val encoder = 
org.apache.spark.sql.catalyst.encoders.RowEncoder.apply(schema).resolveAndBind()
val df2 = spark.createDataFrame(df.queryExecution.toRdd.map(encoder.fromRow), 
schema)
df2.write.format("parquet").mode("overwrite").save("file:///tmp/parquet-write")
 {code}

> Benchmark write/read paths on Hudi vs non-Hudi datasets
> ---
>
> Key: HUDI-84
> URL: https://issues.apache.org/jira/browse/HUDI-84
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: realtime-data-lakes
> Attachments: df-toRdd-write.pdf, df-write-stage.pdf
>
>
> * Index performance
>  * SparkSQL 
> (https://github.com/apache/incubator-hudi/issues/588#issuecomment-468055059)
>  * Query planning Planning 
>  * Bulk_insert, log ingest
>  * upsert, database change log. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] dengziming commented on a change in pull request #1151: [WIP] [HUDI-476] Add hudi-examples module

dengziming commented on a change in pull request #1151: [WIP] [HUDI-476] Add 
hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#discussion_r368794985
 
 

 ##
 File path: 
hudi-examples/src/main/java/org/apache/hudi/examples/spark/HoodieWriteClientExample.java
 ##
 @@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.examples.spark;
+
+import org.apache.hudi.HoodieWriteClient;
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.examples.common.HoodieExampleDataGenerator;
+import org.apache.hudi.index.HoodieIndex;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+
+/**
+ * Simple examples of #{@link HoodieWriteClient}.
+ *
+ * To run this example, you should
+ *   1. For running in IDE, set VM options `-Dspark.master=local[2]`
+ *   2. For running in shell, using `spark-submit`
+ *
+ * Usage: HoodieWriteClientExample  
+ *  and  describe root path of hudi and table name
+ * for example, `HoodieWriteClientExample file:///tmp/hoodie/sample-table 
hoodie_rt`
+ */
+public class HoodieWriteClientExample {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieWriteClientExample.class);
+
+  private static String tableType = HoodieTableType.COPY_ON_WRITE.name();
+
+  public static void main(String[] args) throws Exception {
+if (args.length < 2) {
+  System.err.println("Usage: HoodieWriteClientExample  
");
+  System.exit(1);
+}
+String tablePath = args[0];
+String tableName = args[1];
+SparkConf sparkConf = new SparkConf().setAppName("hoodie-client-example");
+sparkConf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer");
+sparkConf.set("spark.kryoserializer.buffer.max", "512m");
+sparkConf.set("spark.some.config.option", "some-value");
 
 Review comment:
   you are right, thank you, fixed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Assigned] (HUDI-84) Benchmark write/read paths on Hudi vs non-Hudi datasets



 [ 
https://issues.apache.org/jira/browse/HUDI-84?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-84:
--

Assignee: Vinoth Chandar  (was: Ethan Guo)

> Benchmark write/read paths on Hudi vs non-Hudi datasets
> ---
>
> Key: HUDI-84
> URL: https://issues.apache.org/jira/browse/HUDI-84
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: realtime-data-lakes
>
> * Index performance
>  * SparkSQL 
> (https://github.com/apache/incubator-hudi/issues/588#issuecomment-468055059)
>  * Query planning Planning 
>  * Bulk_insert, log ingest
>  * upsert, database change log. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support



[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019821#comment-17019821
 ] 

vinoyang commented on HUDI-538:
---

bq. A good analogy is to think of the first integration with Flink as similar 
to hudi-spark. you can write Spark programs consuming any spark datasource 
today and write out hudi datasets, without using deltastreamer, right? 

Right. OK, let's discuss {{DeltaStreamer}} after we have {{hudi-flink}} module.

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.
> Some thoughts wrote here: 
> https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds 
guidelines on deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#discussion_r368789255
 
 

 ##
 File path: docs/_docs/2_6_deployment.md
 ##
 @@ -1,51 +1,87 @@
 ---
-title: Administering Hudi Pipelines
-keywords: hudi, administration, operation, devops
-permalink: /docs/admin_guide.html
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/deployment.html
+summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+Specifically, we will cover the following aspects.
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
 
 Review comment:
   Hi, miss . at the end of statement.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds 
guidelines on deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#discussion_r368789212
 
 

 ##
 File path: docs/_docs/2_6_deployment.md
 ##
 @@ -1,51 +1,87 @@
 ---
-title: Administering Hudi Pipelines
-keywords: hudi, administration, operation, devops
-permalink: /docs/admin_guide.html
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/deployment.html
+summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+Specifically, we will cover the following aspects.
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
 
 Review comment:
   Hi, miss . at the end of statement.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds 
guidelines on deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#discussion_r368788488
 
 

 ##
 File path: docs/_docs/2_6_deployment.md
 ##
 @@ -1,51 +1,87 @@
 ---
-title: Administering Hudi Pipelines
-keywords: hudi, administration, operation, devops
-permalink: /docs/admin_guide.html
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/deployment.html
+summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+Specifically, we will cover the following aspects.
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
+ - [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
+ - [Interacting via CLI](#cli) : Using the CLI to perform maintenance or 
deeper introspection
+ - [Monitoring](#monitoring) : Tracking metrics from your hudi tables using 
popular tools.
+ - [Troubleshooting](#troubleshooting) : Uncovering, triaging and resolving 
issues in production.
+ 
+## Deploying
 
-This section provides a glimpse into each of these, with some general guidance 
on [troubleshooting](#troubleshooting)
+All in all, Hudi deploys with no long running servers or additional 
infrastructure cost to your data lake. In fact, Hudi pioneered this model of 
building a transactional distributed storage layer
+using existing infrastructure and its heartening to see other systems adopting 
similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer 
or custom Spark datasource jobs), deployed per standard Apache Spark 
[recommendations](https://spark.apache.org/docs/latest/cluster-overview.html).
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or Presto and hence no additional infrastructure is necessary. 
 
-## Admin CLI
 
-Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`.
-A hudi dataset resides on DFS, in a location referred to as the **basePath** 
and we would need this location in order to connect to a Hudi dataset.
-Hudi library effectively manages this dataset internally, using .hoodie 
subfolder to track all metadata
+## Upgrading 
+
+New Hudi releases are listed on the [releases page](/releases), with detailed 
notes which list all the changes, with highlights in each release. 
+At the end of the day, Hudi is a storage system and with that comes a lot of 
responsibilities, which we take seriously. 
+
+As general guidelines, 
+
+ - We strive to keep all changes backwards compatible (i.e new code can read 
old data/timeline files) and we cannot we will provide upgrade/downgrade tools 
via the CLI
+ - We cannot always guarantee forward compatibility (i.e old code being able 
to read data/timeline files written by a greater version). This is generally 
the norm, since no new features can be built otherwise.
+   However any large such changes, will be turned off by default, for smooth 
transition to newer release. After a few releases and once enough users deem 
the feature stable in production, we will flip the defaults in a subsequent 
release.
+ - Always upgrade the query bundles (mr-bundle, presto-bundle, spark-bundle) 
first and then upgrade the writers (deltastreamer, spark jobs using 
datasource). This often provides the best experience and it's easy to fix 
+   any issues by rolling forward/back the writer code (which typically you 
might have more control over)
+ - With large, feature rich releases we recommend migrating slowly, by first 
testing in staging environments and running your own tests. Upgrading Hudi is 
no different than upgrading any database system.
+
+Note that release notes can override this information with specific 
instructions, applicable on case-by-case basis.
+
+## Migrating
+
+Currently migrating to Hudi can be done using two approaches 
+
+- **Convert newer partitions to Hudi** : This model is suitable for large 
event tables (e.g: click streams, ad impressions), which also typically receive 
writes for the last few days alone. You can convert the last 
+   N partitions to Hudi and proceed writing as if it were a Hudi table to 
begin with. The Hudi query side code is able to correctly handle both hudi and 
non-hudi data partitions.
+- **Full conversion

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds 
guidelines on deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#discussion_r368788488
 
 

 ##
 File path: docs/_docs/2_6_deployment.md
 ##
 @@ -1,51 +1,87 @@
 ---
-title: Administering Hudi Pipelines
-keywords: hudi, administration, operation, devops
-permalink: /docs/admin_guide.html
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/deployment.html
+summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+Specifically, we will cover the following aspects.
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
+ - [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
+ - [Interacting via CLI](#cli) : Using the CLI to perform maintenance or 
deeper introspection
+ - [Monitoring](#monitoring) : Tracking metrics from your hudi tables using 
popular tools.
+ - [Troubleshooting](#troubleshooting) : Uncovering, triaging and resolving 
issues in production.
+ 
+## Deploying
 
-This section provides a glimpse into each of these, with some general guidance 
on [troubleshooting](#troubleshooting)
+All in all, Hudi deploys with no long running servers or additional 
infrastructure cost to your data lake. In fact, Hudi pioneered this model of 
building a transactional distributed storage layer
+using existing infrastructure and its heartening to see other systems adopting 
similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer 
or custom Spark datasource jobs), deployed per standard Apache Spark 
[recommendations](https://spark.apache.org/docs/latest/cluster-overview.html).
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or Presto and hence no additional infrastructure is necessary. 
 
-## Admin CLI
 
-Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`.
-A hudi dataset resides on DFS, in a location referred to as the **basePath** 
and we would need this location in order to connect to a Hudi dataset.
-Hudi library effectively manages this dataset internally, using .hoodie 
subfolder to track all metadata
+## Upgrading 
+
+New Hudi releases are listed on the [releases page](/releases), with detailed 
notes which list all the changes, with highlights in each release. 
+At the end of the day, Hudi is a storage system and with that comes a lot of 
responsibilities, which we take seriously. 
+
+As general guidelines, 
+
+ - We strive to keep all changes backwards compatible (i.e new code can read 
old data/timeline files) and we cannot we will provide upgrade/downgrade tools 
via the CLI
+ - We cannot always guarantee forward compatibility (i.e old code being able 
to read data/timeline files written by a greater version). This is generally 
the norm, since no new features can be built otherwise.
+   However any large such changes, will be turned off by default, for smooth 
transition to newer release. After a few releases and once enough users deem 
the feature stable in production, we will flip the defaults in a subsequent 
release.
+ - Always upgrade the query bundles (mr-bundle, presto-bundle, spark-bundle) 
first and then upgrade the writers (deltastreamer, spark jobs using 
datasource). This often provides the best experience and it's easy to fix 
+   any issues by rolling forward/back the writer code (which typically you 
might have more control over)
+ - With large, feature rich releases we recommend migrating slowly, by first 
testing in staging environments and running your own tests. Upgrading Hudi is 
no different than upgrading any database system.
+
+Note that release notes can override this information with specific 
instructions, applicable on case-by-case basis.
+
+## Migrating
+
+Currently migrating to Hudi can be done using two approaches 
+
+- **Convert newer partitions to Hudi** : This model is suitable for large 
event tables (e.g: click streams, ad impressions), which also typically receive 
writes for the last few days alone. You can convert the last 
+   N partitions to Hudi and proceed writing as if it were a Hudi table to 
begin with. The Hudi query side code is able to correctly handle both hudi and 
non-hudi data partitions.
+- **Full conversion

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken edited a comment on issue #1261: [HUDI-403] Adds guidelines on 
deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#issuecomment-576493665
 
 
   Hello, I synced changes to https://lamber-ken.github.io/docs/deployment.html 
, it's helpful for reviewing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken commented on issue #1261: [HUDI-403] Adds guidelines on 
deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#issuecomment-576493665
 
 
   Hello, I synced changes to 
https://lamber-ken.github.io/docs/deployment.html, it's helpful for reviewing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support



[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019818#comment-17019818
 ] 

Vinoth Chandar commented on HUDI-538:
-

>Otherwise, where the records wait to be writen come from?

Flink should have existing sources right.. for e.g Kafka, Pulsar.. Those would 
continue to work. A good analogy is to think of the first integration with 
Flink as similar to hudi-spark. you can write Spark programs consuming any 
spark datasource today and write out hudi datasets, without using 
deltastreamer, right? 

 

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.
> Some thoughts wrote here: 
> https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-560) Remove legacy IdentityTransformer



[ 
https://issues.apache.org/jira/browse/HUDI-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019816#comment-17019816
 ] 

Vinoth Chandar commented on HUDI-560:
-

[~vbalaji] may have more context on this actually 

> Remove legacy IdentityTransformer
> -
>
> Key: HUDI-560
> URL: https://issues.apache.org/jira/browse/HUDI-560
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>
> Currently, {{IdentityTransformer}} has not been used anywhere in Hudi 
> codebase. And it seems it's just like a  pass-through transformer. Can we 
> remove it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support



[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019814#comment-17019814
 ] 

vinoyang commented on HUDI-538:
---

bg. Initially, the DeltaStreamer will not work over Flink, and that may be 
okay? 

{{DeltaStreamer}} may not support Flink in the first phase. However, there 
should be some sources support Flink(let Flink connector with the upstream 
system) to read records then write to Hudi. Otherwise, where the records wait 
to be writen come from?

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.
> Some thoughts wrote here: 
> https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-538) Restructuring hudi client module for multi engine support



[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019814#comment-17019814
 ] 

vinoyang edited comment on HUDI-538 at 1/21/20 2:32 AM:


bq. Initially, the DeltaStreamer will not work over Flink, and that may be 
okay? 

[~vinoth] {{DeltaStreamer}} may not support Flink in the first phase. However, 
there should be some sources support Flink(let Flink connector with the 
upstream system) to read records then write to Hudi. Otherwise, where the 
records wait to be writen come from?


was (Author: yanghua):
bg. Initially, the DeltaStreamer will not work over Flink, and that may be 
okay? 

{{DeltaStreamer}} may not support Flink in the first phase. However, there 
should be some sources support Flink(let Flink connector with the upstream 
system) to read records then write to Hudi. Otherwise, where the records wait 
to be writen come from?

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.
> Some thoughts wrote here: 
> https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds 
guidelines on deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#discussion_r368786695
 
 

 ##
 File path: docs/_docs/2_6_deployment.md
 ##
 @@ -1,51 +1,87 @@
 ---
-title: Administering Hudi Pipelines
-keywords: hudi, administration, operation, devops
-permalink: /docs/admin_guide.html
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/deployment.html
+summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+Specifically, we will cover the following aspects.
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
+ - [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
+ - [Interacting via CLI](#cli) : Using the CLI to perform maintenance or 
deeper introspection
+ - [Monitoring](#monitoring) : Tracking metrics from your hudi tables using 
popular tools.
+ - [Troubleshooting](#troubleshooting) : Uncovering, triaging and resolving 
issues in production.
+ 
+## Deploying
 
-This section provides a glimpse into each of these, with some general guidance 
on [troubleshooting](#troubleshooting)
+All in all, Hudi deploys with no long running servers or additional 
infrastructure cost to your data lake. In fact, Hudi pioneered this model of 
building a transactional distributed storage layer
+using existing infrastructure and its heartening to see other systems adopting 
similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer 
or custom Spark datasource jobs), deployed per standard Apache Spark 
[recommendations](https://spark.apache.org/docs/latest/cluster-overview.html).
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or Presto and hence no additional infrastructure is necessary. 
 
-## Admin CLI
 
-Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`.
-A hudi dataset resides on DFS, in a location referred to as the **basePath** 
and we would need this location in order to connect to a Hudi dataset.
-Hudi library effectively manages this dataset internally, using .hoodie 
subfolder to track all metadata
+## Upgrading 
+
+New Hudi releases are listed on the [releases page](/releases), with detailed 
notes which list all the changes, with highlights in each release. 
+At the end of the day, Hudi is a storage system and with that comes a lot of 
responsibilities, which we take seriously. 
+
+As general guidelines, 
+
+ - We strive to keep all changes backwards compatible (i.e new code can read 
old data/timeline files) and we cannot we will provide upgrade/downgrade tools 
via the CLI
+ - We cannot always guarantee forward compatibility (i.e old code being able 
to read data/timeline files written by a greater version). This is generally 
the norm, since no new features can be built otherwise.
+   However any large such changes, will be turned off by default, for smooth 
transition to newer release. After a few releases and once enough users deem 
the feature stable in production, we will flip the defaults in a subsequent 
release.
+ - Always upgrade the query bundles (mr-bundle, presto-bundle, spark-bundle) 
first and then upgrade the writers (deltastreamer, spark jobs using 
datasource). This often provides the best experience and it's easy to fix 
+   any issues by rolling forward/back the writer code (which typically you 
might have more control over)
+ - With large, feature rich releases we recommend migrating slowly, by first 
testing in staging environments and running your own tests. Upgrading Hudi is 
no different than upgrading any database system.
+
+Note that release notes can override this information with specific 
instructions, applicable on case-by-case basis.
+
+## Migrating
+
+Currently migrating to Hudi can be done using two approaches 
 
 Review comment:
   Hi, miss `.` at the end of statement.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yihua commented on a change in pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion

yihua commented on a change in pull request #1246: [HUDI-552] Fix the schema 
mismatch in Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246#discussion_r368785363
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
 ##
 @@ -620,6 +636,62 @@ public void testDistributedTestDataSource() {
 Assert.assertEquals(1000, c);
   }
 
+  private static void prepareParquetDFSFiles(int numRecords) throws 
IOException {
+String path = PARQUET_SOURCE_ROOT + "/1.parquet";
+HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+Helpers.saveParquetToDFS(Helpers.toGenericRecords(
+dataGenerator.generateInserts("000", numRecords), dataGenerator), new 
Path(path));
+  }
+
+  private void prepareParquetDFSSource(boolean useSchemaProvider, boolean 
hasTransformer) throws IOException {
+// Properties used for testing delta-streamer with Parquet source
+TypedProperties parquetProps = new TypedProperties();
+parquetProps.setProperty("include", "base.properties");
+parquetProps.setProperty("hoodie.datasource.write.recordkey.field", 
"_row_key");
+parquetProps.setProperty("hoodie.datasource.write.partitionpath.field", 
"not_there");
+if (useSchemaProvider) {
+  
parquetProps.setProperty("hoodie.deltastreamer.schemaprovider.source.schema.file",
 dfsBasePath + "/source.avsc");
+  if (hasTransformer) {
+
parquetProps.setProperty("hoodie.deltastreamer.schemaprovider.source.schema.file",
 dfsBasePath + "/target.avsc");
 
 Review comment:
   Good catch.  I've found and fixed that in #1165.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784217
 
 

 ##
 File path: docs/_docs/2_2_writing_data.md
 ##
 @@ -156,41 +157,31 @@ inputDF.write()
 
 ## Syncing to Hive
 
-Both tools above support syncing of the dataset's latest schema to Hive 
metastore, such that queries can pick up new columns and partitions.
+Both tools above support syncing of the table's latest schema to Hive 
metastore, such that queries can pick up new columns and partitions.
 In case, its preferable to run this from commandline or in an independent jvm, 
Hudi provides a `HiveSyncTool`, which can be invoked as below, 
-once you have built the hudi-hive module.
+once you have built the hudi-hive module. Following is how we sync the above 
Datasource Writer written table to Hive metastore.
+
+```java
+cd hudi-hive
+./run_sync_tool.sh  --jdbc-url jdbc:hive2:\/\/hiveserver:1 --user hive 
--pass hive --partitioned-by partition --base-path  --database 
default --table 
+```
+
+Starting with Hudi 0.5.1 version read optimized version of merge-on-read 
tables are suffixed '_ro' by default. For backwards compatibility with older 
Hudi versions, 
+an optional HiveSyncConfig - `--skip-ro-suffix`, has been provided to turn off 
'_ro' suffixing if desired. Explore other hive sync options using the following 
command:
 
 ```java
 cd hudi-hive
 ./run_sync_tool.sh
  [hudi-hive]$ ./run_sync_tool.sh --help
-Usage:  [options]
-  Options:
-  * --base-path
-   Basepath of Hudi dataset to sync
-  * --database
-   name of the target database in Hive
---help, -h
-   Default: false
-  * --jdbc-url
-   Hive jdbc connect url
-  * --use-jdbc
-   Whether to use jdbc connection or hive metastore (via thrift)
-  * --pass
-   Hive password
-  * --table
-   name of the target table in Hive
-  * --user
-   Hive username
 ```
 
 ## Deletes 
 
-Hudi supports implementing two types of deletes on data stored in Hudi 
datasets, by enabling the user to specify a different record payload 
implementation. 
+Hudi supports implementing two types of deletes on data stored in Hudi tables, 
by enabling the user to specify a different record payload implementation. 
 
 Review comment:
   lets link to the delete blog from here? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783925
 
 

 ##
 File path: docs/_docs/2_1_concepts.md
 ##
 @@ -1,37 +1,37 @@
 ---
 title: "Concepts"
-keywords: hudi, design, storage, views, timeline
+keywords: hudi, design, table, queries, timeline
 permalink: /docs/concepts.html
 summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
+Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over hadoop compatible storages
 
- * Upsert (how do I change the dataset?)
- * Incremental pull   (how do I fetch data that changed?)
+ * Update/Delete Records  (how do I change records in a table?)
+ * Change Streams (how do I fetch data that changed?)
 
 Review comment:
   how do I fetch `records` that changed ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783760
 
 

 ##
 File path: docs/_docs/1_3_use_cases.md
 ##
 @@ -20,7 +20,7 @@ or [complicated handcrafted merge 
workflows](http://hortonworks.com/blog/four-st
 For NoSQL datastores like [Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), even moderately big installations store 
billions of rows.
 It goes without saying that __full bulk loads are simply infeasible__ and more 
efficient approaches are needed if ingestion is to keep up with the typically 
high update volumes.
 
-Even for immutable data sources like [Kafka](kafka.apache.org) , Hudi helps 
__enforces a minimum file size on HDFS__, which improves NameNode health by 
solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
+Even for immutable data sources like [Kafka](http://kafka.apache.org) , Hudi 
helps __enforces a minimum file size on HDFS__, which improves NameNode health 
by solving one of the [age old problems in Hadoop 
land](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) in a 
holistic way. This is all the more important for event streams, since typically 
its higher volume (eg: click streams) and if not managed well, can cause 
serious damage to your Hadoop cluster.
 
 Review comment:
   ah. good catch


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510]
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784404

##
File path: docs/_docs/2_3_querying_data.md
##
@@ -1,47 +1,52 @@
---
-title: Querying Hudi Datasets
+title: Querying Hudi Tables
keywords: hudi, hive, spark, sql, presto
permalink: /docs/querying_data.html
summary: In this page, we go over how to enable SQL queries on Hudi built
tables.
toc: true
last_modified_at: 2019-12-30T15:59:57-04:00
---

-Conceptually, Hudi stores data physically once on DFS, while providing 3
logical views on top, as explained [before](/docs/concepts.html#views).
-Once the dataset is synced to the Hive metastore, it provides external Hive
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the dataset can be queried by popular query engines
like Hive, Spark and Presto.
+Conceptually, Hudi stores data physically once on DFS, while providing 3
different ways of querying, as explained
[before](/docs/concepts.html#query-types).
+Once the table is synced to the Hive metastore, it provides external Hive
tables backed by Hudi's custom inputformats. Once the proper hudi
+bundle has been provided, the table can be queried by popular query engines
like Hive, Spark and Presto.

-Specifically, there are two Hive tables named off [table
name](/docs/configurations.html#TABLE_NAME_OPT_KEY) passed during write.
-For e.g, if `table name = hudi_tbl`, then we get
+Specifically, following Hive tables are registered based off [table
name](/docs/configurations.html#TABLE_NAME_OPT_KEY)
+and [table type](/docs/configurations.html#TABLE_TYPE_OPT_KEY) passed during
write.

- - `hudi_tbl` realizes the read optimized view of the dataset backed by
`HoodieParquetInputFormat`, exposing purely columnar data.
- - `hudi_tbl_rt` realizes the real time view of the dataset backed by
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+If `table name = hudi_trips` and `table type = COPY_ON_WRITE`, then we get:
+ - `hudi_trips` supports snapshot querying and incremental querying of the
table backed by `HoodieParquetInputFormat`, exposing purely columnar data.
+
+
+If `table name = hudi_trips` and `table type = MERGE_ON_READ`, then we get:
+ - `hudi_trips_rt` supports snapshot querying and incremental querying
(providing near-real time data) of the table backed by
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_trips_ro` supports read optimized querying of the table backed by
`HoodieParquetInputFormat`, exposing purely columnar data.
+

As discussed in the concepts section, the one key primitive needed for
[incrementally
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi
datasets can be pulled incrementally, which means you can get ALL and ONLY the
updated & new rows
+is `incremental pulls` (to obtain a change stream/log from a table). Hudi
tables can be pulled incrementally, which means you can get ALL and ONLY the
updated & new rows
since a specified instant time. This, together with upserts, are particularly
useful for building data pipelines where 1 or more source Hudi tables are
incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to [write out
deltas](/docs/writing_data.html) to a target Hudi dataset. Incremental view is
realized by querying one of the tables above,
-with special configurations that indicates to query planning that only
incremental data needs to be fetched out of the dataset.
+joined with other tables (tables/dimensions), to [write out
deltas](/docs/writing_data.html) to a target Hudi table. Incremental view is
realized by querying one of the tables above,
+with special configurations that indicates to query planning that only
incremental data needs to be fetched out of the table.

-In sections, below we will discuss in detail how to access all the 3 views on
each query engine.
+In sections, below we will discuss how to access these query types from
different query engines.

## Hive

-In order for Hive to recognize Hudi datasets and query correctly, the
HiveServer2 needs to be provided with the
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`
+In order for Hive to recognize Hudi tables and query correctly, the
HiveServer2 needs to be provided with the
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`
in its [aux jars
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
This will ensure the input format
classes with its dependencies are available for query planning & execution.

-### Read Optimized table
+### Read optimized querying
In addition to setup above, for beeline cli access, the

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368783683
 
 

 ##
 File path: docs/_docs/1_2_structure.md
 ##
 @@ -6,16 +6,16 @@ summary: "Hudi brings stream processing to big data, 
providing fresh data while
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical 
datasets over DFS 
([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
 or cloud stores) and provides three logical views for query access.
+Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical 
tables over DFS 
([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
 or cloud stores) and provides three types of querying.
 
- * **Read Optimized View** - Provides excellent query performance on pure 
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
- * **Incremental View** - Provides a change stream out of the dataset to feed 
downstream jobs/ETLs.
- * **Near-Real time Table** - Provides queries on real-time data, using a 
combination of columnar & row based storage (e.g Parquet + 
[Avro](http://avro.apache.org/docs/current/mr.html))
+ * **Read Optimized querying** - Provides excellent query performance on pure 
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
 
 Review comment:
   just `Query` and not `querying`? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

vinothchandar commented on a change in pull request #1260: [WIP] [HUDI-510] 
Update site documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#discussion_r368784018
 
 

 ##
 File path: docs/_docs/2_1_concepts.md
 ##
 @@ -53,69 +53,70 @@ With the help of the timeline, an incremental query 
attempting to get all new da
 only the changed files without say scanning all the time buckets > 07:00.
 
 ## File management
-Hudi organizes a datasets into a directory structure under a `basepath` on 
DFS. Dataset is broken up into partitions, which are folders containing data 
files for that partition,
+Hudi organizes a table into a directory structure under a `basepath` on DFS. 
Table is broken up into partitions, which are folders containing data files for 
that partition,
 very similar to Hive tables. Each partition is uniquely identified by its 
`partitionpath`, which is relative to the basepath.
 
 Within each partition, files are organized into `file groups`, uniquely 
identified by a `file id`. Each file group contains several
-`file slices`, where each slice contains a base columnar file (`*.parquet`) 
produced at a certain commit/compaction instant time,
+`file slices`, where each slice contains a base file (`*.parquet`) produced at 
a certain commit/compaction instant time,
  along with set of log files (`*.log.*`) that contain inserts/updates to the 
base file since the base file was produced. 
 Hudi adopts a MVCC design, where compaction action merges logs and base files 
to produce new file slices and cleaning action gets rid of 
 unused/older file slices to reclaim space on DFS. 
 
-Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file group, via an indexing mechanism. 
+## Index
+Hudi provides efficient upserts, by mapping a given hoodie key (record key + 
partition path) consistently to a file id, via an indexing mechanism. 
 This mapping between record key and file group/file id, never changes once the 
first version of a record has been written to a file. In short, the 
 mapped file group contains all versions of a group of records.
 
-## Storage Types & Views
-Hudi storage types define how data is indexed & laid out on the DFS and how 
the above primitives and timeline activities are implemented on top of such 
organization (i.e how data is written). 
-In turn, `views` define how the underlying data is exposed to the queries (i.e 
how data is read). 
+## Table Types & Querying
 
 Review comment:
   and Queries (instead of Querying)? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-403) Publish a deployment guide talking about deployment options, upgrading etc



[ 
https://issues.apache.org/jira/browse/HUDI-403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019805#comment-17019805
 ] 

Vinoth Chandar commented on HUDI-403:
-

[~vbalaji] I have covered everything except deltastreamer and compaction 
deployment.. Can you take a stab at this?  The idea is to give users enough 
easy instructions to schedule compactions as they wish to.. 

> Publish a deployment guide talking about deployment options, upgrading etc
> --
>
> Key: HUDI-403
> URL: https://issues.apache.org/jira/browse/HUDI-403
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Things to cover 
>  # Upgrade readers first, Upgrade writers next, Principles of compatibility 
> followed
>  # DeltaStreamer Deployment models
>  # Scheduling Compactions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on issue #1261: [HUDI-403] Adds guidelines on deployment/upgrading

vinothchandar commented on issue #1261: [HUDI-403] Adds guidelines on 
deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#issuecomment-576488099
 
 
   @bvaradar can you please review and merge?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Assigned] (HUDI-403) Publish a deployment guide talking about deployment options, upgrading etc



 [ 
https://issues.apache.org/jira/browse/HUDI-403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-403:
---

Assignee: Balaji Varadarajan  (was: Vinoth Chandar)

> Publish a deployment guide talking about deployment options, upgrading etc
> --
>
> Key: HUDI-403
> URL: https://issues.apache.org/jira/browse/HUDI-403
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Things to cover 
>  # Upgrade readers first, Upgrade writers next, Principles of compatibility 
> followed
>  # DeltaStreamer Deployment models
>  # Scheduling Compactions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar opened a new pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

vinothchandar opened a new pull request #1261: [HUDI-403] Adds guidelines on 
deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261
 
 
- Moved "Adminsitering" page to "Deployment"
- Still need to add information on deltastreamer modes/compaction
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-403) Publish a deployment guide talking about deployment options, upgrading etc

2020-01-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-403:

Labels: pull-request-available  (was: )

> Publish a deployment guide talking about deployment options, upgrading etc
> --
>
> Key: HUDI-403
> URL: https://issues.apache.org/jira/browse/HUDI-403
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>
> Things to cover 
>  # Upgrade readers first, Upgrade writers next, Principles of compatibility 
> followed
>  # DeltaStreamer Deployment models
>  # Scheduling Compactions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] bhasudha commented on issue #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

bhasudha commented on issue #1260: [WIP] [HUDI-510] Update site documentation 
in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#issuecomment-576481756
 
 
   @leesf / @yanghua can you please help review this PR. Also, this might be 
needed in the corresponding cn pages too. Need your help there as well. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch release-0.5.1 updated: [MINOR] Download KEYS file when validating release candidate (#1259)

2020-01-20 Thread leesf

This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch release-0.5.1
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/release-0.5.1 by this push:
 new bb58be1  [MINOR] Download KEYS file when validating release candidate 
(#1259)
bb58be1 is described below

commit bb58be1c56c6a98ae443557251fa0d023553f6d3
Author: Balaji Varadarajan 
AuthorDate: Mon Jan 20 17:20:47 2020 -0800

[MINOR] Download KEYS file when validating release candidate (#1259)
---
 scripts/release/validate_staged_release.sh | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/scripts/release/validate_staged_release.sh 
b/scripts/release/validate_staged_release.sh
index 31eff95..429047b 100755
--- a/scripts/release/validate_staged_release.sh
+++ b/scripts/release/validate_staged_release.sh
@@ -84,6 +84,9 @@ echo "Checking Checksum of Source Release"
 diff -u hudi-${RELEASE_VERSION}-incubating-rc${RC_NUM}.src.tgz.sha512 
got.sha512 
 echo -e "\t\tChecksum Check of Source Release - [OK]\n"
 
+# Download KEYS file
+curl https://dist.apache.org/repos/dist/release/incubator/hudi/KEYS > ../KEYS
+
 # GPG Check
 echo "Checking Signature"
 (bash -c "gpg --import ../KEYS $REDIRECT" && bash -c "gpg --verify 
hudi-${RELEASE_VERSION}-incubating-rc${RC_NUM}.src.tgz.asc 
hudi-${RELEASE_VERSION}-incubating-rc${RC_NUM}.src.tgz $REDIRECT" && echo -e 
"\t\tSignature Check - [OK]\n") || (echo -e "\t\tSignature Check - [FAILED] - 
Run with --verbose to get details\n" && exit -1)

[GitHub] [incubator-hudi] bhasudha commented on issue #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

bhasudha commented on issue #1260: [WIP] [HUDI-510] Update site documentation 
in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260#issuecomment-576480047
 
 
   @vinothchandar I kept it as WIP as I am still working on other changes such 
as scala version, quickstart fix etc. But wanted to send out a PR so the 
renaming part can be reviewed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-510) Update site documentation in sync with cWiki

2020-01-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-510:

Labels: pull-request-available  (was: )

> Update site documentation in sync with cWiki
> 
>
> Key: HUDI-510
> URL: https://issues.apache.org/jira/browse/HUDI-510
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] bhasudha opened a new pull request #1260: [WIP] [HUDI-510] Update site documentation in sync with cWiki

bhasudha opened a new pull request #1260: [WIP] [HUDI-510] Update site 
documentation in sync with cWiki
URL: https://github.com/apache/incubator-hudi/pull/1260
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1233: [HUDI-335] Improvements to DiskBasedMap used by ExternalSpillableMap,…

bvaradar commented on a change in pull request #1233: [HUDI-335] Improvements 
to DiskBasedMap used by ExternalSpillableMap,…
URL: https://github.com/apache/incubator-hudi/pull/1233#discussion_r368775609
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/BufferedRandomAccessFile.java
 ##
 @@ -0,0 +1,411 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.log4j.Logger;
+
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.RandomAccessFile;
+import java.nio.ByteBuffer;
+
+/**
+ * Use a private buffer for the read/write/seek operations of the 
RandomAccessFile
 
 Review comment:
   @vinothchandar @nbalajee : I think we should still retain the LICENSE part 
and the reference in this file to be conservative as this implementation was 
based on the Cassandra source code and the structure is essentially same. It is 
better to err on the side of caution here. Let me know if you strongly think 
otherwise (cc @n3nash ).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated (6e59c1c -> 924bf51)

2020-01-20 Thread leesf

This is an automated email from the ASF dual-hosted git repository.

leesf pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 6e59c1c  Moving to 0.5.2-SNAPSHOT on master branch.
 add 924bf51  [MINOR] Download KEYS file when validating release candidate 
(#1259)

No new revisions were added by this update.

Summary of changes:
 scripts/release/validate_staged_release.sh | 3 +++
 1 file changed, 3 insertions(+)

[GitHub] [incubator-hudi] leesf merged pull request #1259: [MINOR] Download KEYS file when validating release candidate

leesf merged pull request #1259: [MINOR] Download KEYS file when validating 
release candidate
URL: https://github.com/apache/incubator-hudi/pull/1259
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion

nsivabalan commented on a change in pull request #1246: [HUDI-552] Fix the 
schema mismatch in Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246#discussion_r368772209
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java
 ##
 @@ -620,6 +636,62 @@ public void testDistributedTestDataSource() {
 Assert.assertEquals(1000, c);
   }
 
+  private static void prepareParquetDFSFiles(int numRecords) throws 
IOException {
+String path = PARQUET_SOURCE_ROOT + "/1.parquet";
+HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+Helpers.saveParquetToDFS(Helpers.toGenericRecords(
+dataGenerator.generateInserts("000", numRecords), dataGenerator), new 
Path(path));
+  }
+
+  private void prepareParquetDFSSource(boolean useSchemaProvider, boolean 
hasTransformer) throws IOException {
+// Properties used for testing delta-streamer with Parquet source
+TypedProperties parquetProps = new TypedProperties();
+parquetProps.setProperty("include", "base.properties");
+parquetProps.setProperty("hoodie.datasource.write.recordkey.field", 
"_row_key");
+parquetProps.setProperty("hoodie.datasource.write.partitionpath.field", 
"not_there");
+if (useSchemaProvider) {
+  
parquetProps.setProperty("hoodie.deltastreamer.schemaprovider.source.schema.file",
 dfsBasePath + "/source.avsc");
+  if (hasTransformer) {
+
parquetProps.setProperty("hoodie.deltastreamer.schemaprovider.source.schema.file",
 dfsBasePath + "/target.avsc");
 
 Review comment:
   is the key to this property right? Isn't ".target.schema.file" ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1233: [HUDI-335] Improvements to DiskBasedMap used by ExternalSpillableMap,…

vinothchandar commented on a change in pull request #1233: [HUDI-335] 
Improvements to DiskBasedMap used by ExternalSpillableMap,…
URL: https://github.com/apache/incubator-hudi/pull/1233#discussion_r368751754
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/BufferedRandomAccessFile.java
 ##
 @@ -0,0 +1,411 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.log4j.Logger;
+
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.RandomAccessFile;
+import java.nio.ByteBuffer;
+
+/**
+ * Use a private buffer for the read/write/seek operations of the 
RandomAccessFile
 
 Review comment:
   Thanks! 
   @bvaradar @n3nash should we then back out the license change.. I am 
confused. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nbalajee commented on a change in pull request #1233: [HUDI-335] Improvements to DiskBasedMap used by ExternalSpillableMap,…

nbalajee commented on a change in pull request #1233: [HUDI-335] Improvements 
to DiskBasedMap used by ExternalSpillableMap,…
URL: https://github.com/apache/incubator-hudi/pull/1233#discussion_r368729156
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/BufferedRandomAccessFile.java
 ##
 @@ -0,0 +1,411 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.log4j.Logger;
+
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.RandomAccessFile;
+import java.nio.ByteBuffer;
+
+/**
+ * Use a private buffer for the read/write/seek operations of the 
RandomAccessFile
 
 Review comment:
   Yes Vinoth.  This is our own implementation of BufferedRandomAccessFile.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1259: [MINOR] Download KEYS file when validating release candidate

bvaradar commented on issue #1259: [MINOR] Download KEYS file when validating 
release candidate
URL: https://github.com/apache/incubator-hudi/pull/1259#issuecomment-576419074
 
 
   @leesf : You would need this fix before you send the email to dev@ for 
voting. Kindly check if release validation works and succeeds before the voting 
email. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar opened a new pull request #1259: [MINOR] Download KEYS file when validating release candidate

bvaradar opened a new pull request #1259: [MINOR] Download KEYS file when 
validating release candidate
URL: https://github.com/apache/incubator-hudi/pull/1259
 
 
   As KEYS file is no longer present in repo, here is the change in release 
validation script to download it 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

smarthi commented on a change in pull request #1253: [HUDI-558] Introduce 
ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r368711970
 
 

 ##
 File path: hudi-cli/src/main/scala/org/apache/hudi/cli/SparkHelpers.scala
 ##
 @@ -43,7 +43,7 @@ object SparkHelpers {
 val schema: Schema = sourceRecords.get(0).getSchema
 val filter: BloomFilter = 
BloomFilterFactory.createBloomFilter(HoodieIndexConfig.DEFAULT_BLOOM_FILTER_NUM_ENTRIES.toInt,
 HoodieIndexConfig.DEFAULT_BLOOM_FILTER_FPP.toDouble,
   
HoodieIndexConfig.DEFAULT_HOODIE_BLOOM_INDEX_FILTER_DYNAMIC_MAX_ENTRIES.toInt, 
HoodieIndexConfig.DEFAULT_BLOOM_INDEX_FILTER_TYPE);
-val writeSupport: HoodieAvroWriteSupport = new HoodieAvroWriteSupport(new 
AvroSchemaConverter().convert(schema), schema, filter)
+val writeSupport: HoodieAvroWriteSupport = new HoodieAvroWriteSupport(new 
AvroSchemaConverter().convert(schema), schema, filter, 
java.lang.Boolean.valueOf(HoodieIndexConfig.BLOOM_INDEX_ENABLE_COMPRESSION))
 
 Review comment:
   replace Boolean.valueOf() with Boolean.parseBoolean().


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

smarthi commented on a change in pull request #1253: [HUDI-558] Introduce 
ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r368711264
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##
 @@ -166,6 +168,11 @@ public Builder bloomIndexBucketizedChecking(boolean 
bucketizedChecking) {
   return this;
 }
 
+public Builder bloomIndexEnableCompression(boolean enableCompression) {
+  props.setProperty(BLOOM_INDEX_ENABLE_COMPRESSION, 
String.valueOf(enableCompression));
 
 Review comment:
   just call Boolean.toString() instead of String.valueOf()


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

smarthi commented on a change in pull request #1253: [HUDI-558] Introduce 
ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r368710257
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
 ##
 @@ -149,13 +150,26 @@ public static BloomFilter 
readBloomFilterFromParquetMetadata(Configuration confi
 readParquetFooter(configuration, false, parquetFilePath,
 HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY,
 HoodieAvroWriteSupport.OLD_HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY,
-HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_TYPE_CODE);
+HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_TYPE_CODE,
+HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_IS_COMPRESSED,
+HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_COMPRESSION_TYPE);
 String footerVal = 
footerVals.get(HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY);
 if (null == footerVal) {
   // We use old style key "com.uber.hoodie.bloomfilter"
   footerVal = 
footerVals.get(HoodieAvroWriteSupport.OLD_HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY);
 }
 BloomFilter toReturn = null;
+boolean isCompressed = false;
+if 
(footerVals.containsKey(HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_IS_COMPRESSED))
 {
+  isCompressed = 
Boolean.valueOf(footerVals.get(HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_IS_COMPRESSED));
+  if (isCompressed) {
+String compressionType = 
footerVals.get(HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_COMPRESSION_TYPE);
+
Preconditions.checkArgument(compressionType.equals(GzipCompressionUtils.TYPE),
 
 Review comment:
   this can be replaced with ValidationUtils.checkArgument() once the PR# 1159 
has been merged 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1253: [HUDI-558] Introduce ability to compress bloom filters while storing in parquet

smarthi commented on a change in pull request #1253: [HUDI-558] Introduce 
ability to compress bloom filters while storing in parquet
URL: https://github.com/apache/incubator-hudi/pull/1253#discussion_r368708574
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##
 @@ -318,6 +318,10 @@ public double getBloomFilterFPP() {
 return 
Double.parseDouble(props.getProperty(HoodieIndexConfig.BLOOM_FILTER_FPP));
   }
 
+  public boolean isBloomFilterCompressionEnabled() {
+return 
Boolean.valueOf(props.getProperty(HoodieIndexConfig.BLOOM_INDEX_ENABLE_COMPRESSION));
 
 Review comment:
   use Boolean.parseBoolean() instead ??


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-76) CSV Source support for Hudi Delta Streamer

2020-01-20 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-76:
--
Fix Version/s: (was: 0.5.1)
   0.6.0

> CSV Source support for Hudi Delta Streamer
> --
>
> Key: HUDI-76
> URL: https://issues.apache.org/jira/browse/HUDI-76
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Incremental Pull
>Reporter: Balaji Varadarajan
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> DeltaStreamer does not have support to pull CSV data from sources (hdfs log 
> files/kafka). THis ticket is to provide support for csv sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-76) CSV Source support for Hudi Delta Streamer

2020-01-20 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-76:
--
Priority: Major  (was: Minor)

> CSV Source support for Hudi Delta Streamer
> --
>
> Key: HUDI-76
> URL: https://issues.apache.org/jira/browse/HUDI-76
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Incremental Pull
>Reporter: Balaji Varadarajan
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> DeltaStreamer does not have support to pull CSV data from sources (hdfs log 
> files/kafka). THis ticket is to provide support for csv sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-76) CSV Source support for Hudi Delta Streamer

2020-01-20 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-76?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-76:
--
Fix Version/s: (was: 0.6.0)
   0.5.2

> CSV Source support for Hudi Delta Streamer
> --
>
> Key: HUDI-76
> URL: https://issues.apache.org/jira/browse/HUDI-76
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Incremental Pull
>Reporter: Balaji Varadarajan
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> DeltaStreamer does not have support to pull CSV data from sources (hdfs log 
> files/kafka). THis ticket is to provide support for csv sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[incubator-hudi] branch master updated: Moving to 0.5.2-SNAPSHOT on master branch.

2020-01-20 Thread vbalaji

This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 6e59c1c  Moving to 0.5.2-SNAPSHOT on master branch.
6e59c1c is described below

commit 6e59c1c77798e53f74f581856a3609239f50a877
Author: leesf <490081...@qq.com>
AuthorDate: Mon Jan 20 18:57:26 2020 +0800

Moving to 0.5.2-SNAPSHOT on master branch.
---
 docker/hoodie/hadoop/base/pom.xml | 2 +-
 docker/hoodie/hadoop/datanode/pom.xml | 2 +-
 docker/hoodie/hadoop/historyserver/pom.xml| 2 +-
 docker/hoodie/hadoop/hive_base/pom.xml| 2 +-
 docker/hoodie/hadoop/namenode/pom.xml | 2 +-
 docker/hoodie/hadoop/pom.xml  | 2 +-
 docker/hoodie/hadoop/prestobase/pom.xml   | 2 +-
 docker/hoodie/hadoop/spark_base/pom.xml   | 2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml   | 2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml  | 2 +-
 hudi-cli/pom.xml  | 2 +-
 hudi-client/pom.xml   | 2 +-
 hudi-common/pom.xml   | 2 +-
 hudi-hadoop-mr/pom.xml| 2 +-
 hudi-hive/pom.xml | 2 +-
 hudi-integ-test/pom.xml   | 2 +-
 hudi-spark/pom.xml| 2 +-
 hudi-timeline-service/pom.xml | 2 +-
 hudi-utilities/pom.xml| 2 +-
 packaging/hudi-hadoop-mr-bundle/pom.xml   | 2 +-
 packaging/hudi-hive-bundle/pom.xml| 2 +-
 packaging/hudi-presto-bundle/pom.xml  | 2 +-
 packaging/hudi-spark-bundle/pom.xml   | 2 +-
 packaging/hudi-timeline-server-bundle/pom.xml | 2 +-
 packaging/hudi-utilities-bundle/pom.xml   | 2 +-
 pom.xml   | 2 +-
 27 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/docker/hoodie/hadoop/base/pom.xml 
b/docker/hoodie/hadoop/base/pom.xml
index 972da32..0cbd377 100644
--- a/docker/hoodie/hadoop/base/pom.xml
+++ b/docker/hoodie/hadoop/base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.2-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/datanode/pom.xml 
b/docker/hoodie/hadoop/datanode/pom.xml
index fa52b12..034aebe 100644
--- a/docker/hoodie/hadoop/datanode/pom.xml
+++ b/docker/hoodie/hadoop/datanode/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.2-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/historyserver/pom.xml 
b/docker/hoodie/hadoop/historyserver/pom.xml
index c84abc0..b41ca5c 100644
--- a/docker/hoodie/hadoop/historyserver/pom.xml
+++ b/docker/hoodie/hadoop/historyserver/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.2-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/hive_base/pom.xml 
b/docker/hoodie/hadoop/hive_base/pom.xml
index 5af64d0..7e5db2e 100644
--- a/docker/hoodie/hadoop/hive_base/pom.xml
+++ b/docker/hoodie/hadoop/hive_base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.2-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/namenode/pom.xml 
b/docker/hoodie/hadoop/namenode/pom.xml
index 06deb12..c35ff45 100644
--- a/docker/hoodie/hadoop/namenode/pom.xml
+++ b/docker/hoodie/hadoop/namenode/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.2-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/pom.xml b/docker/hoodie/hadoop/pom.xml
index 84380f7..e2d0482 100644
--- a/docker/hoodie/hadoop/pom.xml
+++ b/docker/hoodie/hadoop/pom.xml
@@ -19,7 +19,7 @@
   
 hudi
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.2-SNAPSHOT
 ../../../pom.xml
   
   4.0.0
diff --git a/docker/hoodie/hadoop/prestobase/pom.xml 
b/docker/hoodie/hadoop/prestobase/pom.xml
index f9b6180..4a1b7fa 100644
--- a/docker/hoodie/hadoop/prestobase/pom.xml
+++ b/docker/hoodie/hadoop/prestobase/pom.xml
@@ -22,7 +22,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.2-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/spark_base/pom.xml 
b/docker/hoodie/hadoop/spark_base/pom.xml
index ee1b2f2..e9a4d5a 100644
--- a/docker/hoodie/hadoop/spark_base/pom.xml
+++ b/docker/hoodie/hadoop/spark_base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.2-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/sparkadhoc/pom.xml 
b/docker/hoodie/hadoop/sparkadhoc/pom.xml
index bb6ebb0..1e008e5 100644
--- a/docker/hoodie/hadoop/sparkadhoc/pom.xml
+++ b/docker/hoodie/hadoop/sparkadhoc/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-

[GitHub] [incubator-hudi] bvaradar merged pull request #1257: Moving to 0.5.2-SNAPSHOT on master branch.

bvaradar merged pull request #1257: Moving to 0.5.2-SNAPSHOT on master branch.
URL: https://github.com/apache/incubator-hudi/pull/1257
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch release-0.5.1 updated: Preparing for Release 0.5.1-incubating-rc1

2020-01-20 Thread vbalaji

This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch release-0.5.1
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/release-0.5.1 by this push:
 new 94ed9b9  Preparing for Release 0.5.1-incubating-rc1
94ed9b9 is described below

commit 94ed9b9122b664d16b633cc36bfd161fa0d117c5
Author: leesf <490081...@qq.com>
AuthorDate: Mon Jan 20 19:17:20 2020 +0800

Preparing for Release 0.5.1-incubating-rc1
---
 docker/hoodie/hadoop/base/pom.xml | 2 +-
 docker/hoodie/hadoop/datanode/pom.xml | 2 +-
 docker/hoodie/hadoop/historyserver/pom.xml| 2 +-
 docker/hoodie/hadoop/hive_base/pom.xml| 2 +-
 docker/hoodie/hadoop/namenode/pom.xml | 2 +-
 docker/hoodie/hadoop/pom.xml  | 2 +-
 docker/hoodie/hadoop/prestobase/pom.xml   | 2 +-
 docker/hoodie/hadoop/spark_base/pom.xml   | 2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml   | 2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml  | 2 +-
 hudi-cli/pom.xml  | 2 +-
 hudi-client/pom.xml   | 2 +-
 hudi-common/pom.xml   | 2 +-
 hudi-hadoop-mr/pom.xml| 2 +-
 hudi-hive/pom.xml | 2 +-
 hudi-integ-test/pom.xml   | 2 +-
 hudi-spark/pom.xml| 2 +-
 hudi-timeline-service/pom.xml | 2 +-
 hudi-utilities/pom.xml| 2 +-
 packaging/hudi-hadoop-mr-bundle/pom.xml   | 2 +-
 packaging/hudi-hive-bundle/pom.xml| 2 +-
 packaging/hudi-presto-bundle/pom.xml  | 2 +-
 packaging/hudi-spark-bundle/pom.xml   | 2 +-
 packaging/hudi-timeline-server-bundle/pom.xml | 2 +-
 packaging/hudi-utilities-bundle/pom.xml   | 2 +-
 pom.xml   | 2 +-
 27 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/docker/hoodie/hadoop/base/pom.xml 
b/docker/hoodie/hadoop/base/pom.xml
index 972da32..8b6c421 100644
--- a/docker/hoodie/hadoop/base/pom.xml
+++ b/docker/hoodie/hadoop/base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.1-incubating-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/datanode/pom.xml 
b/docker/hoodie/hadoop/datanode/pom.xml
index fa52b12..8b6710a 100644
--- a/docker/hoodie/hadoop/datanode/pom.xml
+++ b/docker/hoodie/hadoop/datanode/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.1-incubating-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/historyserver/pom.xml 
b/docker/hoodie/hadoop/historyserver/pom.xml
index c84abc0..1d22df5 100644
--- a/docker/hoodie/hadoop/historyserver/pom.xml
+++ b/docker/hoodie/hadoop/historyserver/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.1-incubating-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/hive_base/pom.xml 
b/docker/hoodie/hadoop/hive_base/pom.xml
index 5af64d0..f83d2ad 100644
--- a/docker/hoodie/hadoop/hive_base/pom.xml
+++ b/docker/hoodie/hadoop/hive_base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.1-incubating-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/namenode/pom.xml 
b/docker/hoodie/hadoop/namenode/pom.xml
index 06deb12..49b23e6 100644
--- a/docker/hoodie/hadoop/namenode/pom.xml
+++ b/docker/hoodie/hadoop/namenode/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.1-incubating-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/pom.xml b/docker/hoodie/hadoop/pom.xml
index 84380f7..afe6693 100644
--- a/docker/hoodie/hadoop/pom.xml
+++ b/docker/hoodie/hadoop/pom.xml
@@ -19,7 +19,7 @@
   
 hudi
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.1-incubating-rc1
 ../../../pom.xml
   
   4.0.0
diff --git a/docker/hoodie/hadoop/prestobase/pom.xml 
b/docker/hoodie/hadoop/prestobase/pom.xml
index f9b6180..6eb9d27 100644
--- a/docker/hoodie/hadoop/prestobase/pom.xml
+++ b/docker/hoodie/hadoop/prestobase/pom.xml
@@ -22,7 +22,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.1-incubating-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/spark_base/pom.xml 
b/docker/hoodie/hadoop/spark_base/pom.xml
index ee1b2f2..1cd6d77 100644
--- a/docker/hoodie/hadoop/spark_base/pom.xml
+++ b/docker/hoodie/hadoop/spark_base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.5.1-SNAPSHOT
+0.5.1-incubating-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/sparkadhoc/pom.xml 
b/docker/hoodie/hadoop/sparkadhoc/pom.xml
index bb6ebb0..8edb652 100644
--- a/docker/hoodie/hadoop/sparkadhoc/pom.xml
+++ b/docker/hoodie/hadoop/sparkadhoc/pom.xml
@@

[GitHub] [incubator-hudi] bvaradar merged pull request #1258: Preparing for Release 0.5.1-incubating-rc1

bvaradar merged pull request #1258: Preparing for Release 0.5.1-incubating-rc1
URL: https://github.com/apache/incubator-hudi/pull/1258
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1149: [WIP] [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

vinothchandar commented on issue #1149: [WIP] [HUDI-472] Introduce 
configurations and new modes of sorting for bulk_insert
URL: https://github.com/apache/incubator-hudi/pull/1149#issuecomment-576387868
 
 
   @umehrot2 fyi 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support



[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019555#comment-17019555
 ] 

Vinoth Chandar commented on HUDI-538:
-

[~yanghua] the sources in hudi-utilities are simply Hoodie's own abstractions 
for DeltaStreamer.. Initially, the DeltaStreamer will not work over Flink, and 
that may be okay? 

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.
> Some thoughts wrote here: 
> https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1151: [WIP] [HUDI-476] Add hudi-examples module

pratyakshsharma commented on a change in pull request #1151: [WIP] [HUDI-476] 
Add hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#discussion_r368602223
 
 

 ##
 File path: 
hudi-examples/src/main/java/org/apache/hudi/examples/spark/HoodieWriteClientExample.java
 ##
 @@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.examples.spark;
+
+import org.apache.hudi.HoodieWriteClient;
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.examples.common.HoodieExampleDataGenerator;
+import org.apache.hudi.index.HoodieIndex;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+
+/**
+ * Simple examples of #{@link HoodieWriteClient}.
+ *
+ * To run this example, you should
+ *   1. For running in IDE, set VM options `-Dspark.master=local[2]`
+ *   2. For running in shell, using `spark-submit`
+ *
+ * Usage: HoodieWriteClientExample  
+ *  and  describe root path of hudi and table name
+ * for example, `HoodieWriteClientExample file:///tmp/hoodie/sample-table 
hoodie_rt`
+ */
+public class HoodieWriteClientExample {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieWriteClientExample.class);
+
+  private static String tableType = HoodieTableType.COPY_ON_WRITE.name();
+
+  public static void main(String[] args) throws Exception {
+if (args.length < 2) {
+  System.err.println("Usage: HoodieWriteClientExample  
");
+  System.exit(1);
+}
+String tablePath = args[0];
+String tableName = args[1];
+SparkConf sparkConf = new SparkConf().setAppName("hoodie-client-example");
+sparkConf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer");
+sparkConf.set("spark.kryoserializer.buffer.max", "512m");
+sparkConf.set("spark.some.config.option", "some-value");
 
 Review comment:
   can we expose a function to get sparkConf in some utility class in this 
module? I see this is duplicate code in every class.  @dengziming 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-539) No FileSystem for scheme: abfss



 [ 
https://issues.apache.org/jira/browse/HUDI-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-539:

Fix Version/s: 0.6.0

> No FileSystem for scheme: abfss
> ---
>
> Key: HUDI-539
> URL: https://issues.apache.org/jira/browse/HUDI-539
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.5.1
> Environment: Spark version : 2.4.4
> Hadoop version : 2.7.3
> Databricks Runtime: 6.1
>Reporter: Sam Somuah
>Priority: Major
> Fix For: 0.6.0
>
>
> Hi,
>  I'm trying to use hudi to write to one of the Azure storage container file 
> systems, ADLS Gen 2 (abfs://). ABFS:// is one of the whitelisted file 
> schemes. The issue I'm facing is that in {{HoodieROTablePathFilter}} it tries 
> to get a file path passing in a blank hadoop configuration. This manifests as 
> {{java.io.IOException: No FileSystem for scheme: abfss}} because it doesn't 
> have any of the configuration in the environment.
> The problematic line is
> [https://github.com/apache/incubator-hudi/blob/2bb0c21a3dd29687e49d362ed34f050380ff47ae/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L96]
>  
> Stacktrace
> java.io.IOException: No FileSystem for scheme: abfss
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.hudi.hadoop.HoodieROTablePathFilter.accept(HoodieROTablePathFilter.java:96)
> at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$16.apply(InMemoryFileIndex.scala:349)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-539) No FileSystem for scheme: abfss