date:20200107

[jira] [Commented] (HUDI-127) [Good to do] Tidy up cWiki

2020-01-07 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009483#comment-17009483
 ] 

Vinoth Chandar commented on HUDI-127:
-

with the 
[https://cwiki.apache.org/confluence/display/HUDI/Design+And+Architecture]  
page, we are in good shape here.. cWIki has no partial pages atm 

> [Good to do] Tidy up cWiki
> --
>
> Key: HUDI-127
> URL: https://issues.apache.org/jira/browse/HUDI-127
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
>
> * More blog content
>  * FInished technical docs
>  * People page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-127) [Good to do] Tidy up cWiki

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-127.
---
Resolution: Fixed

> [Good to do] Tidy up cWiki
> --
>
> Key: HUDI-127
> URL: https://issues.apache.org/jira/browse/HUDI-127
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
>
> * More blog content
>  * FInished technical docs
>  * People page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-127) [Good to do] Tidy up cWiki

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-127:

Status: In Progress  (was: Open)

> [Good to do] Tidy up cWiki
> --
>
> Key: HUDI-127
> URL: https://issues.apache.org/jira/browse/HUDI-127
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
>
> * More blog content
>  * FInished technical docs
>  * People page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] Guru107 commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

Guru107 commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571486297
 
 
   @bvaradar @lamber-ken I am still getting this error after I took the latest 
pull and tested. I am using Hudi spark data source.
   
   ```
   20/01/07 12:37:42 ERROR io.HoodieCommitArchiveLog: Failed to archive 
commits, .commit file: 20191229022905.clean.requested
   java.io.IOException: Not an Avro data file
at 
org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at 
org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
at 
org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:291)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:250)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:123)
at 
org.apache.hudi.HoodieWriteClient.postCommit(HoodieWriteClient.java:485)
at 
org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:156)
at 
org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:100)
at 
org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:91)
at 
org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:261)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
   
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on issue #1191: [HUDI-503] Add hudi test suite documentation into the README file of the test suite module

2020-01-07 Thread GitBox

yanghua commented on issue #1191: [HUDI-503] Add hudi test suite documentation 
into the README file of the test suite module
URL: https://github.com/apache/incubator-hudi/pull/1191#issuecomment-571541880
 
 
   cc @n3nash 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] hddong commented on issue #1114: [HUDI-438] Merge duplicated code fragment

2020-01-07 Thread GitBox

hddong commented on issue #1114: [HUDI-438] Merge duplicated code fragment
URL: https://github.com/apache/incubator-hudi/pull/1114#issuecomment-571518957
 
 
   @leesf @nsivabalan thanks for your review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] Panxing4game edited a comment on issue #1189: [HUDI-376]: AWS Glue dependency issue for EMR 5.28.0

2020-01-07 Thread GitBox

Panxing4game edited a comment on issue #1189: [HUDI-376]: AWS Glue dependency 
issue for EMR 5.28.0
URL: https://github.com/apache/incubator-hudi/pull/1189#issuecomment-571518371
 
 
   > Thanks for opening this PR @Panxing4game ! You might only need to modify 
s3_filesystem.md and s3_filesystem.cn.md, and no need to modify the html file.
   
   Oh... My bad...
   @leesf 
just check the site documentation and updated the PR.
   Thanks for review!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] Panxing4game commented on issue #1189: [HUDI-376]: AWS Glue dependency issue for EMR 5.28.0

2020-01-07 Thread GitBox

Panxing4game commented on issue #1189: [HUDI-376]: AWS Glue dependency issue 
for EMR 5.28.0
URL: https://github.com/apache/incubator-hudi/pull/1189#issuecomment-571518371
 
 
   > Thanks for opening this PR @Panxing4game ! You might only need to modify 
s3_filesystem.md and s3_filesystem.cn.md, and no need to modify the html file.
   
   Oh... My bad... just check the site documentation and updated the PR.
   Thanks for review!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] Panxing4game commented on issue #1189: [HUDI-376]: AWS Glue dependency issue for EMR 5.28.0

2020-01-07 Thread GitBox

Panxing4game commented on issue #1189: [HUDI-376]: AWS Glue dependency issue 
for EMR 5.28.0
URL: https://github.com/apache/incubator-hudi/pull/1189#issuecomment-571520147
 
 
   I think I could translate this s3 cn.md page to Chinese as well. 
   Maybe in another ticket :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] Guru107 commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

Guru107 commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571496082
 
 
   @lamber-ken  I had run the spark job after merging of 
[HUDI-308](https://github.com/apache/incubator-hudi/pull/1009), it has created 
many 0 bytes files in .hoodie folder and started failing.  I rebuilt Hudi after 
rebasing with master but still, it is not able to read those 0 bytes files.
   
   Below is the screenshot of the files in .hoodie folder
   
   ![Screenshot 2020-01-07 at 2 20 57 
PM](https://user-images.githubusercontent.com/5235591/71881820-1b69f280-3159-11ea-99b3-687286652f05.png)
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] hddong commented on issue #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-01-07 Thread GitBox

hddong commented on issue #1157: [HUDI-332]Add operation type 
(insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#issuecomment-571521331
 
 
   @bvaradar Operation type is stored in the avro objects when archiving, but 
there are a error here, it throw `ClassCastException: 
org.apache.avro.generic.GenericData$EnumSymbol cannot be cast to 
org.apache.hudi.avro.model.WriteOperationType` with `deepCopy` block in 'show 
archived commit stats' . Can you give any suggestion when you're free.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on issue #1115: [HUDI-392] Introduce DIstributedTestDataSource to generate test data

2020-01-07 Thread GitBox

yanghua commented on issue #1115: [HUDI-392] Introduce 
DIstributedTestDataSource to generate test data
URL: https://github.com/apache/incubator-hudi/pull/1115#issuecomment-571537763
 
 
   @n3nash OK, will try to review the whole test suite again to see if I can 
find some issues.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

lamber-ken commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571490358
 
 
   > @bvaradar @lamber-ken I am still getting this error after I took the 
latest pull and tested. I am using Hudi spark data source.
   > 
   > ```
   > 20/01/07 12:37:42 ERROR io.HoodieCommitArchiveLog: Failed to archive 
commits, .commit file: 20191229022905.clean.requested
   > java.io.IOException: Not an Avro data file
   >at 
org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
   >at 
org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
   >at 
org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
   >at 
org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:291)
   >at 
org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:250)
   >at 
org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:123)
   >at 
org.apache.hudi.HoodieWriteClient.postCommit(HoodieWriteClient.java:485)
   >at 
org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:156)
   >at 
org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:100)
   >at 
org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:91)
   >at 
org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:261)
   >at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183)
   >at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
   >at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
   >at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   >at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   >at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   >at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   >at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   >at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   >at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   >at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   >at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   >at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
   >at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
   > ```
   
   hello @Guru107, some detail info needed.
   1, please `ll -a .hoodie`
   2, detail reproduce steps


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

lamber-ken commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571499096
 
 
   > @lamber-ken I had run the spark job after merging of 
[HUDI-308](https://github.com/apache/incubator-hudi/pull/1009), it has created 
many 0 bytes files in .hoodie folder and started failing. I rebuilt Hudi after 
rebasing with master but still, it is not able to read those 0 bytes files.
   > 
   > Below is the screenshot of the files in .hoodie folder
   > 
   > ![Screenshot 2020-01-07 at 2 20 57 
PM](https://user-images.githubusercontent.com/5235591/71881820-1b69f280-3159-11ea-99b3-687286652f05.png)
   
   Hi @Guru107, can you just checkout master branch, instead of using git 
rebase.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1191: [HUDI-503] Add hudi test suite documentation into the README file of the test suite module

2020-01-07 Thread GitBox

yanghua commented on a change in pull request #1191: [HUDI-503] Add hudi test 
suite documentation into the README file of the test suite module
URL: https://github.com/apache/incubator-hudi/pull/1191#discussion_r363690393
 
 

 ##
 File path: hudi-test-suite/README.md
 ##
 @@ -0,0 +1,291 @@
+
+
+This page describes in detail how to run end to end tests on a hudi dataset 
that helps in improving our confidence 
+in a release as well as perform large scale performance benchmarks.  
+
+# Objectives
+
+1. Test with different versions of core libraries and components such as 
`hdfs`, `parquet`, `spark`, 
+`hive` and `avro`.
+2. Generate different types of workloads across different dimensions such as 
`payload size`, `number of updates`, 
+`number of inserts`, `number of partitions`
+3. Perform multiple types of operations such as `insert`, `bulk_insert`, 
`upsert`, `compact`, `query`
+4. Support custom post process actions and validations
+
+# High Level Design
+
+The Hudi test suite runs as a long running spark job. The suite is divided 
into the following high level components : 
+
+## Workload Generation
+
+This component does the work of generating the workload; `inserts`, `upserts` 
etc.
+
+## Workload Scheduling
+
+Depending on the type of workload generated, data is either ingested into the 
target hudi 
+dataset or the corresponding workload operation is executed. For example 
compaction does not necessarily need a workload
+to be generated/ingested but can require an execution.
+
+## Other actions/operatons
+
+The test suite supports different types of operations besides ingestion such 
as Hive Query execution, Clean action etc.
+
+# Usage instructions
+
+
+## Entry class to the test suite
+
+```
+org.apache.hudi.bench.job.HudiTestSuiteJob.java - Entry Point of the hudi test 
suite job. This 
+class wraps all the functionalities required to run a configurable integration 
suite.
+```
+
+## Configurations required to run the job
+```
+org.apache.hudi.bench.job.HudiTestSuiteConfig - Config class that drives the 
behavior of the 
+integration test suite. This class extends from 
com.uber.hoodie.utilities.DeltaStreamerConfig. Look at 
+link#HudiDeltaStreamer page to learn about all the available configs 
applicable to your test suite.
+```
+
+## Generating a custom Workload Pattern
+```
+There are 2 ways to generate a workload pattern
+1. Programatically
+Choose to write up the entire DAG of operations programatically, take a look 
at WorkflowDagGenerator class.
+Once you're ready with the DAG you want to execute, simply pass the class name 
as follows
+spark-submit
+...
+...
+--class org.apache.hudi.bench.job.HudiTestSuiteJob 
+--workload-generator-classname 
org.apache.hudi.bench.dag.scheduler.
+...
+2. YAML file
+Choose to write up the entire DAG of operations in YAML, take a look at 
complex-workload-dag-cow.yaml or 
+complex-workload-dag-mor.yaml.
+Once you're ready with the DAG you want to execute, simply pass the yaml file 
path as follows
+spark-submit
+...
+...
+--class org.apache.hudi.bench.job.HudiTestSuiteJob 
+--workload-yaml-path /path/to/your-workflow-dag.yaml
+...
+```
 
 Review comment:
   Thanks for your suggestion. I have addressed it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (HUDI-508) Standardize on using "Table" instead of "Dataset"

2020-01-07 Thread Vinoth Chandar (Jira)

Vinoth Chandar created HUDI-508:
---

 Summary: Standardize on using "Table" instead of "Dataset"
 Key: HUDI-508
 URL: https://issues.apache.org/jira/browse/HUDI-508
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Code Cleanup
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-508) Standardize on using "Table" instead of "Dataset"

2020-01-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-508:

Labels: pull-request-available  (was: )

> Standardize on using "Table" instead of "Dataset"
> -
>
> Key: HUDI-508
> URL: https://issues.apache.org/jira/browse/HUDI-508
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-509) Rename "views" into "query types" according to cWiki

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-509:

Status: Open  (was: New)

> Rename "views" into "query types" according to cWiki
> 
>
> Key: HUDI-509
> URL: https://issues.apache.org/jira/browse/HUDI-509
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-510) Update site documentation in sync with cWiki

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-510:

Status: Open  (was: New)

> Update site documentation in sync with cWiki
> 
>
> Key: HUDI-510
> URL: https://issues.apache.org/jira/browse/HUDI-510
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-508) Standardize on using "Table" instead of "Dataset"

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-508:

Status: Open  (was: New)

> Standardize on using "Table" instead of "Dataset"
> -
>
> Key: HUDI-508
> URL: https://issues.apache.org/jira/browse/HUDI-508
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-510) Update site documentation in sync with cWiki

2020-01-07 Thread Vinoth Chandar (Jira)

Vinoth Chandar created HUDI-510:
---

 Summary: Update site documentation in sync with cWiki
 Key: HUDI-510
 URL: https://issues.apache.org/jira/browse/HUDI-510
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Docs
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar opened a new pull request #1197: [WIP] [HUDI-508] Standardizing on "Table" instead of "Dataset" across code

2020-01-07 Thread GitBox

vinothchandar opened a new pull request #1197: [WIP] [HUDI-508] Standardizing 
on "Table" instead of "Dataset" across code
URL: https://github.com/apache/incubator-hudi/pull/1197
 
 
- Docs were talking about storage types before, cWiki moved to "Table"
- Most of code already has HoodieTable, HoodieTableMetaClient - correct 
naming
- Replacing renaming use of dataset across code/comments
- Few usages in comments and use of Spark SQL DataSet remain unscathed
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
- Docs were talking about storage types before, cWiki moved to "Table"
- Most of code already has HoodieTable, HoodieTableMetaClient - correct 
naming
- Replacing renaming use of dataset across code/comments
- Few usages in comments and use of Spark SQL DataSet remain unscathed
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (HUDI-507) Support \ t split hdfs source

2020-01-07 Thread liujinhui (Jira)

liujinhui created HUDI-507:
--

 Summary: Support \ t split hdfs source
 Key: HUDI-507
 URL: https://issues.apache.org/jira/browse/HUDI-507
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Utilities
Reporter: liujinhui
 Fix For: 0.5.1


hi,hudi

 

Current Hudi data source does not support HDFS file data splitting with \ t 
separator
I want to complete it and contribute to the community.
The main change is the addition of the TextDFSSource class to provide support.
The specific new logic is: split the hdfs data according to the delimiter, and 
then map it to the source.avsc pattern

 

Or do some other symbol format as an extension

thanks,

liujh

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-509) Rename "views" into "query types" according to cWiki

2020-01-07 Thread Vinoth Chandar (Jira)

Vinoth Chandar created HUDI-509:
---

 Summary: Rename "views" into "query types" according to cWiki
 Key: HUDI-509
 URL: https://issues.apache.org/jira/browse/HUDI-509
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Code Cleanup
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on issue #1197: [HUDI-508] Standardizing on "Table" instead of "Dataset" across code

2020-01-07 Thread GitBox

vinothchandar commented on issue #1197: [HUDI-508] Standardizing on "Table" 
instead of "Dataset" across code
URL: https://github.com/apache/incubator-hudi/pull/1197#issuecomment-571693255
 
 
   @n3nash can you please review


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1159: [WIP][HUDI-479] Eliminate or Minimize use of Guava if possible

2020-01-07 Thread GitBox

vinothchandar commented on a change in pull request #1159: [WIP][HUDI-479] 
Eliminate or Minimize use of Guava if possible
URL: https://github.com/apache/incubator-hudi/pull/1159#discussion_r363820025
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/ValidationUtils.java
 ##
 @@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+/*
+ * Simple utility to test validation conditions
+ */
+public class ValidationUtils {
+
+  /**
+   * Ensures the truth of an expression.
+   */
+  public static void checkArgument(final boolean expression) {
+if (!expression) {
+  throw new IllegalArgumentException();
+}
+  }
+
+  /**
+   * Ensures the truth of an expression, throwing the custom errorMessage 
otherwise.
+   */
+  public static void checkArgument(final boolean expression, final String 
errorMessage) {
+if (!expression) {
+  throw new IllegalArgumentException(errorMessage);
+}
+  }
+
+  /**
+   * Ensures the truth of an expression involving the state of the calling 
instance, but not
+   * involving any parameters to the calling method.
+   *
+   * @param expression a boolean expression
+   * @throws IllegalStateException if {@code expression} is false
+   */
+  public static void checkState(final boolean expression) {
 
 Review comment:
   this seems same as checkArgument? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1159: [WIP][HUDI-479] Eliminate or Minimize use of Guava if possible

2020-01-07 Thread GitBox

vinothchandar commented on a change in pull request #1159: [WIP][HUDI-479] 
Eliminate or Minimize use of Guava if possible
URL: https://github.com/apache/incubator-hudi/pull/1159#discussion_r363820681
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java
 ##
 @@ -67,4 +69,45 @@ public static String toHexString(byte[] bytes) {
   public static boolean isNullOrEmpty(String str) {
 return str == null || str.length() == 0;
   }
+
+  /**
+   * Returns the given string if it is non-null; the empty string otherwise.
+   *
+   * @param string the string to test and possibly return
+   * @return {@code string} itself if it is non-null; {@code ""} if it is null
+   */
+  public static String nullToEmpty(@Nullable String string) {
+return string == null ? "" : string;
+  }
+
+  /**
+   * Returns the given string if it is nonempty; {@code null} otherwise.
+   *
+   * @param string the string to test and possibly return
+   * @return {@code string} itself if it is nonempty; {@code null} if it is 
empty or null
+   */
+  public static @Nullable String emptyToNull(@Nullable String string) {
+return stringIsNullOrEmpty(string) ? null : string;
+  }
+
+  public static boolean stringIsNullOrEmpty(@Nullable String string) {
 
 Review comment:
   just `isNullOrEmpty()` for brevity?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1159: [WIP][HUDI-479] Eliminate or Minimize use of Guava if possible

2020-01-07 Thread GitBox

vinothchandar commented on a change in pull request #1159: [WIP][HUDI-479] 
Eliminate or Minimize use of Guava if possible
URL: https://github.com/apache/incubator-hudi/pull/1159#discussion_r363821619
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java
 ##
 @@ -67,4 +69,45 @@ public static String toHexString(byte[] bytes) {
   public static boolean isNullOrEmpty(String str) {
 return str == null || str.length() == 0;
   }
+
+  /**
+   * Returns the given string if it is non-null; the empty string otherwise.
+   *
+   * @param string the string to test and possibly return
+   * @return {@code string} itself if it is non-null; {@code ""} if it is null
+   */
+  public static String nullToEmpty(@Nullable String string) {
+return string == null ? "" : string;
+  }
+
+  /**
+   * Returns the given string if it is nonempty; {@code null} otherwise.
+   *
+   * @param string the string to test and possibly return
+   * @return {@code string} itself if it is nonempty; {@code null} if it is 
empty or null
+   */
+  public static @Nullable String emptyToNull(@Nullable String string) {
+return stringIsNullOrEmpty(string) ? null : string;
+  }
+
+  public static boolean stringIsNullOrEmpty(@Nullable String string) {
+return string == null || string.isEmpty();
+  }
+
+  /**
+   * Convert a Signed Hash to Hex.
+   * @param bytes - hashed bytes
+   * @return Hex representation of hash
+   */
+  public static String bytesToHexString(final byte[] bytes) {
 
 Review comment:
   Realized that this class needs a unit test.. :/ 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1195: [HUDI-319] Add a new maven profile to generate unified Javadoc for all Java and Scala classes

2020-01-07 Thread GitBox

vinothchandar commented on a change in pull request #1195: [HUDI-319] Add a new 
maven profile to generate unified Javadoc for all Java and Scala classes
URL: https://github.com/apache/incubator-hudi/pull/1195#discussion_r363823740
 
 

 ##
 File path: pom.xml
 ##
 @@ -938,6 +940,99 @@
 
org.apache.hudi.
   
 
+
+  unijavadoc
 
 Review comment:
   rename to just `javadocs` ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1195: [HUDI-319] Add a new maven profile to generate unified Javadoc for all Java and Scala classes

2020-01-07 Thread GitBox

vinothchandar commented on issue #1195: [HUDI-319] Add a new maven profile to 
generate unified Javadoc for all Java and Scala classes
URL: https://github.com/apache/incubator-hudi/pull/1195#issuecomment-571655264
 
 
   Can merge once we resolve the naming... and also please add a line in 
`README` in the building section on how to generate the javadocs using the 
profile.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1197: [HUDI-508] Standardizing on "Table" instead of "Dataset" across code

2020-01-07 Thread GitBox

n3nash commented on a change in pull request #1197: [HUDI-508] Standardizing on 
"Table" instead of "Dataset" across code
URL: https://github.com/apache/incubator-hudi/pull/1197#discussion_r363901294
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHDFSParquetImporter.java
 ##
 @@ -85,7 +85,7 @@ public static void cleanupClass() throws Exception {
* Test successful data import with retries.
*/
   @Test
-  public void testDatasetImportWithRetries() throws Exception {
+  public void testImportWithRetries() throws Exception {
 
 Review comment:
   testTableImportWithRetries ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

2020-01-07 Thread GitBox

n3nash commented on a change in pull request #1194: [HUDI-326] Add support to 
delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363910472
 
 

 ##
 File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 ##
 @@ -161,7 +162,17 @@ private[hudi] object HoodieSparkSqlWriter {
   // Convert to RDD[HoodieKey]
   val keyGenerator = 
DataSourceUtils.createKeyGenerator(toProperties(parameters))
   val genericRecords: RDD[GenericRecord] = 
AvroConversionUtils.createRdd(df, structName, nameSpace)
-  val hoodieKeysToDelete = genericRecords.map(gr => 
keyGenerator.getKey(gr)).toJavaRDD()
+  val hoodieKeysToDelete = if 
(HoodieIndex.GLOBAL_INDICES.contains(parameters(HoodieIndexConfig.INDEX_TYPE_PROP)))
 {
 
 Review comment:
   I think just create an entirely new keyGenerator since tables with 
GlobalIndex can still be partitioned, all we are doing is here allowing us to 
pass just the record_keys to be able to delete those records.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

2020-01-07 Thread GitBox

bschell commented on a change in pull request #1194: [HUDI-326] Add support to 
delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363869991
 
 

 ##
 File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 ##
 @@ -161,7 +162,17 @@ private[hudi] object HoodieSparkSqlWriter {
   // Convert to RDD[HoodieKey]
   val keyGenerator = 
DataSourceUtils.createKeyGenerator(toProperties(parameters))
   val genericRecords: RDD[GenericRecord] = 
AvroConversionUtils.createRdd(df, structName, nameSpace)
-  val hoodieKeysToDelete = genericRecords.map(gr => 
keyGenerator.getKey(gr)).toJavaRDD()
+  val hoodieKeysToDelete = if 
(HoodieIndex.GLOBAL_INDICES.contains(parameters(HoodieIndexConfig.INDEX_TYPE_PROP)))
 {
 
 Review comment:
   @n3nash thanks for bringing that keyGenerator to my attention. That better 
serves my use case. Do you think extending it into a "GlobalDeleteKeyGenerator" 
would be a valid change instead?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

2020-01-07 Thread GitBox

bschell commented on a change in pull request #1194: [HUDI-326] Add support to 
delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363869991
 
 

 ##
 File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 ##
 @@ -161,7 +162,17 @@ private[hudi] object HoodieSparkSqlWriter {
   // Convert to RDD[HoodieKey]
   val keyGenerator = 
DataSourceUtils.createKeyGenerator(toProperties(parameters))
   val genericRecords: RDD[GenericRecord] = 
AvroConversionUtils.createRdd(df, structName, nameSpace)
-  val hoodieKeysToDelete = genericRecords.map(gr => 
keyGenerator.getKey(gr)).toJavaRDD()
+  val hoodieKeysToDelete = if 
(HoodieIndex.GLOBAL_INDICES.contains(parameters(HoodieIndexConfig.INDEX_TYPE_PROP)))
 {
 
 Review comment:
   @n3nash thanks for bringing that keyGenerator to my attention. That better 
serves my use case. Do you think extending it into a "GlobalDeleteKeyGenerator" 
would be a valid change instead? Or possibly creating an entirely new 
keyGenerator for global deletes instead, because the nonpartitionedKeyGenerator 
would not work if you are using the complex key generator.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2020-01-07 Thread GitBox

umehrot2 commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, 
migrate to spark-avro library instead of databricks-avro, add support for 
Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-571730182
 
 
   > @umehrot2 are you still driving this? We would like to merge this asap, 
giving us enough time for the next release to be cut..
   > 
   > cc @leesf
   @vinothchandar just got back from my time off a couple of days back. Let me 
catch up on this PR and try to get it merged soon. When are we targeting for 
next release to be cut ? 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] zhedoubushishi commented on a change in pull request #1175: [HUDI-495] Update deprecated HBase API

2020-01-07 Thread GitBox

zhedoubushishi commented on a change in pull request #1175: [HUDI-495] Update 
deprecated HBase API
URL: https://github.com/apache/incubator-hudi/pull/1175#discussion_r363919247
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -287,13 +289,10 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
   hbaseConnection = getHBaseConnection();
 }
   }
-  HTable hTable = null;
-  try {
-hTable = (HTable) 
hbaseConnection.getTable(TableName.valueOf(tableName));
+  try (BufferedMutator mutator = 
hbaseConnection.getBufferedMutator(TableName.valueOf(tableName))) {
 
 Review comment:
   Yes I think so. 
   The HBase API doc shows:
   ```
   flushCommits()
   Deprecated. 
   as of 1.0.0. Replaced by BufferedMutator.flush()
   ```
   Also HTable is not threadsafe. 
(https://issues.apache.org/jira/browse/HBASE-17361)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-01-07 Thread Yanjia Gary Li (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010031#comment-17010031
 ] 

Yanjia Gary Li commented on HUDI-494:
-

[~vinoth] Thanks for the feedback. The code snippets were prepared by 
[~lamber-ken] and my dataset has this issue was partitioned by 
year/month/day/hour. The behavior I observed was the path 
*/.hoodie/.temp/20200101/year=2020/month=1/day=1/hour=00* has a ton of 
files. 

For my dataset, I calculate the parallelism based on the input data size. I set 
*bulkInsertParallelism = inputSizeInMB / 100* which was 6 for my 6TB 
dataset. 

the *upsertParallelism = 10* based on the input size when I ran this upsert 
job. 

I will reproduce this once I got the chance and provide more details. 

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Vinoth Chandar
>Priority: Major
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] zhedoubushishi commented on a change in pull request #1175: [HUDI-495] Update deprecated HBase API

2020-01-07 Thread GitBox

zhedoubushishi commented on a change in pull request #1175: [HUDI-495] Update 
deprecated HBase API
URL: https://github.com/apache/incubator-hudi/pull/1175#discussion_r363919247
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
 ##
 @@ -287,13 +289,10 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
   hbaseConnection = getHBaseConnection();
 }
   }
-  HTable hTable = null;
-  try {
-hTable = (HTable) 
hbaseConnection.getTable(TableName.valueOf(tableName));
+  try (BufferedMutator mutator = 
hbaseConnection.getBufferedMutator(TableName.valueOf(tableName))) {
 
 Review comment:
   Yes I think so. 
   The HTable API doc shows:
   ```
   flushCommits()
   Deprecated. 
   as of 1.0.0. Replaced by BufferedMutator.flush()
   ```
   Also HTable is not threadsafe. 
(https://issues.apache.org/jira/browse/HBASE-17361)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1185: [HUDI-500] Use enum method to replace switch case

2020-01-07 Thread GitBox

bvaradar commented on issue #1185: [HUDI-500] Use enum method to replace switch 
case
URL: https://github.com/apache/incubator-hudi/pull/1185#issuecomment-571760972
 
 
   Thanks @dengziming. I agree with @vinothchandar to keep abstraction layers 
smaller and well defined to preserve orthogonality. 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

bvaradar commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571765113
 
 
   @Guru107 : Can you double confirm if the change ( 
https://github.com/apache/incubator-hudi/pull/1128/files ) is included in your 
run (which created the clean.requested files)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1197: [HUDI-508] Standardizing on "Table" instead of "Dataset" across code

2020-01-07 Thread GitBox

vinothchandar commented on a change in pull request #1197: [HUDI-508] 
Standardizing on "Table" instead of "Dataset" across code
URL: https://github.com/apache/incubator-hudi/pull/1197#discussion_r363944418
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHDFSParquetImporter.java
 ##
 @@ -85,7 +85,7 @@ public static void cleanupClass() throws Exception {
* Test successful data import with retries.
*/
   @Test
-  public void testDatasetImportWithRetries() throws Exception {
+  public void testImportWithRetries() throws Exception {
 
 Review comment:
   I found that redundant.. So removed it :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

2020-01-07 Thread GitBox

vinothchandar commented on issue #1005: [HUDI-91][HUDI-12]Migrate to spark 
2.4.4, migrate to spark-avro library instead of databricks-avro, add support 
for Decimal/Date types
URL: https://github.com/apache/incubator-hudi/pull/1005#issuecomment-571764759
 
 
   My suggestion is to freeze code by 15th, test the RC for a week and cut one 
by jan last/feb first week. @leesf is the release manager though.. So he can 
share plans.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1197: [HUDI-508] Standardizing on "Table" instead of "Dataset" across code

2020-01-07 Thread GitBox

vinothchandar commented on issue #1197: [HUDI-508] Standardizing on "Table" 
instead of "Dataset" across code
URL: https://github.com/apache/incubator-hudi/pull/1197#issuecomment-571766001
 
 
   yes. have a separate sub tasks under 
https://jira.apache.org/jira/browse/HUDI-334 
   Views => queries is next.. Will do docs once with these, hopefully we can 
retire the old site as well by then, so I can just update once.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated (8306f74 -> 9706f65)

2020-01-07 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 8306f74  [HUDI-417] Refactor HoodieWriteClient so that commit logic 
can be shareable by both bootstrap and normal write operations (#1166)
 add 9706f65  [HUDI-508] Standardizing on "Table" instead of "Dataset" 
across code (#1197)

No new revisions were added by this update.

Summary of changes:
 .../main/java/org/apache/hudi/cli/HoodieCLI.java   |   4 +-
 .../java/org/apache/hudi/cli/HoodiePrompt.java |   2 +-
 .../apache/hudi/cli/commands/CommitsCommand.java   |   8 +-
 .../hudi/cli/commands/CompactionCommand.java   |   6 +-
 .../cli/commands/HDFSParquetImportCommand.java |  12 +-
 .../apache/hudi/cli/commands/RepairsCommand.java   |   2 +-
 .../{DatasetsCommand.java => TableCommand.java}|  20 +--
 .../java/org/apache/hudi/HoodieReadClient.java |   6 +-
 .../java/org/apache/hudi/HoodieWriteClient.java|  20 +--
 .../apache/hudi/index/bloom/HoodieBloomIndex.java  |   2 +-
 .../hudi/index/bloom/HoodieGlobalBloomIndex.java   |   4 +-
 .../io/compact/HoodieRealtimeTableCompactor.java   |   4 +-
 .../BoundedPartitionAwareCompactionStrategy.java   |   2 +-
 hudi-client/src/test/java/HoodieClientExample.java |   2 +-
 .../apache/hudi/common/HoodieClientTestUtils.java  |   6 +-
 .../apache/hudi/io/TestHoodieCommitArchiveLog.java |   6 +-
 .../apache/hudi/table/TestMergeOnReadTable.java|   2 +-
 .../hudi/common/model/HoodieAvroPayload.java   |   2 +-
 .../hudi/common/model/HoodieRecordPayload.java |   2 +-
 .../common/model/HoodieRollingStatMetadata.java|   2 +-
 .../hudi/common/table/HoodieTableConfig.java   |   8 +-
 .../hudi/common/table/HoodieTableMetaClient.java   |  20 +--
 .../apache/hudi/common/table/HoodieTimeline.java   |   2 +-
 .../table/timeline/HoodieActiveTimeline.java   |   2 +-
 .../table/timeline/HoodieArchivedTimeline.java |   2 +-
 .../hudi/common/table/timeline/HoodieInstant.java  |   2 +-
 .../table/view/AbstractTableFileSystemView.java|   2 +-
 .../common/table/view/FileSystemViewManager.java   |  22 +--
 .../view/RemoteHoodieTableFileSystemView.java  |   4 +-
 .../table/view/RocksDbBasedFileSystemView.java |   4 +-
 .../hudi/common/util/RocksDBSchemaHelper.java  |   2 +-
 .../apache/hudi/exception/HoodieIOException.java   |   2 +-
 ...etException.java => InvalidTableException.java} |   8 +-
 ...dException.java => TableNotFoundException.java} |  20 +--
 .../apache/hudi/common/model/HoodieTestUtils.java  |   2 +-
 .../table/view/TestIncrementalFSViewSync.java  |   2 +-
 .../hudi/hadoop/HoodieParquetInputFormat.java  |  10 +-
 .../hudi/hadoop/HoodieROTablePathFilter.java   |  10 +-
 .../realtime/HoodieParquetRealtimeInputFormat.java |   2 +-
 .../apache/hudi/hadoop/InputFormatTestUtil.java|   8 +-
 .../apache/hudi/hadoop/TestHoodieInputFormat.java  |  10 +-
 .../hudi/hadoop/TestRecordReaderValueIterator.java |   2 +-
 .../realtime/TestHoodieRealtimeRecordReader.java   |  10 +-
 .../java/org/apache/hudi/hive/HiveSyncConfig.java  |   2 +-
 .../java/org/apache/hudi/hive/HiveSyncTool.java|  10 +-
 .../org/apache/hudi/hive/HoodieHiveClient.java |  18 +--
 .../java/org/apache/hudi/hive/util/SchemaUtil.java |   2 +-
 .../org/apache/hudi/hive/TestHiveSyncTool.java |  28 ++--
 .../test/java/org/apache/hudi/hive/TestUtil.java   |   4 +-
 .../main/java/org/apache/hudi/DataSourceUtils.java |   6 +-
 .../org/apache/hudi/HoodieDataSourceHelpers.java   |   2 +-
 .../org/apache/hudi/payload/AWSDmsAvroPayload.java |   2 +-
 .../scala/org/apache/hudi/DataSourceOptions.scala  |   6 +-
 .../main/scala/org/apache/hudi/DefaultSource.scala |   2 +-
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |  10 +-
 .../org/apache/hudi/IncrementalRelation.scala  |   4 +-
 hudi-spark/src/test/java/HoodieJavaApp.java|   2 +-
 .../src/test/java/HoodieJavaStreamingApp.java  |   2 +-
 .../timeline/service/FileSystemViewHandler.java|   6 +-
 .../service/handlers/FileSliceHandler.java |   2 +-
 .../apache/hudi/utilities/HDFSParquetImporter.java |   8 +-
 .../hudi/utilities/HiveIncrementalPuller.java  |   2 +-
 .../org/apache/hudi/utilities/HoodieCleaner.java   |   2 +-
 .../hudi/utilities/HoodieCompactionAdminTool.java  |   2 +-
 .../org/apache/hudi/utilities/HoodieCompactor.java |   2 +-
 .../org/apache/hudi/utilities/UtilHelpers.java |   2 +-
 .../adhoc/UpgradePayloadFromUberToApache.java  |   2 +-
 .../hudi/utilities/deltastreamer/DeltaSync.java|   6 +-
 .../deltastreamer/HoodieDeltaStreamer.java |  10 +-
 .../schema/NullTargetSchemaRegistryProvider.java   |   2 +-
 .../hudi/utilities/schema/SchemaProvider.java  |   2 +-
 .../hudi/utilities/TestHDFSParquetImporter.java|   2 +-
 .../hudi/utilities/TestHoodieDeltaStreamer.java| 176 ++---
 73 files changed, 298

[GitHub] [incubator-hudi] vinothchandar merged pull request #1197: [HUDI-508] Standardizing on "Table" instead of "Dataset" across code

2020-01-07 Thread GitBox

vinothchandar merged pull request #1197: [HUDI-508] Standardizing on "Table" 
instead of "Dataset" across code
URL: https://github.com/apache/incubator-hudi/pull/1197
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-397) Normalize log print statement

2020-01-07 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010096#comment-17010096
 ] 

Nishith Agarwal commented on HUDI-397:
--

[~yanghua] I agree, this was introduced by me during tests to be able to grep 
logs easily while debugging. Let's refactor this to the usual logging, without 
the "–" 

> Normalize log print statement
> -
>
> Key: HUDI-397
> URL: https://issues.apache.org/jira/browse/HUDI-397
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Priority: Major
>
> In test suite module, there are many logging statements looks like this 
> pattern:
> {code:java}
> log.info(String.format("- inserting input data %s 
> --", this.getName()));
> {code}
> IMO, it's not a good design. We need to refactor it.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-433) Improve the way log block magic header is identified when a corrupt block is encountered #416

2020-01-07 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010098#comment-17010098
 ] 

Nishith Agarwal commented on HUDI-433:
--

Yes, we should close this one.

> Improve the way log block magic header is identified when a corrupt block is 
> encountered #416
> -
>
> Key: HUDI-433
> URL: https://issues.apache.org/jira/browse/HUDI-433
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> h1. Improve the way log block magic header is identified when a corrupt block 
> is encountered #416



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-433) Improve the way log block magic header is identified when a corrupt block is encountered #416

2020-01-07 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010098#comment-17010098
 ] 

Nishith Agarwal edited comment on HUDI-433 at 1/7/20 9:20 PM:
--

Yes, we should close this one. For some reason, I'm unable to open, close or 
perform any action on Jira tickets, not sure why


was (Author: nagarwal):
Yes, we should close this one.

> Improve the way log block magic header is identified when a corrupt block is 
> encountered #416
> -
>
> Key: HUDI-433
> URL: https://issues.apache.org/jira/browse/HUDI-433
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> h1. Improve the way log block magic header is identified when a corrupt block 
> is encountered #416



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-433) Improve the way log block magic header is identified when a corrupt block is encountered #416

2020-01-07 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010098#comment-17010098
 ] 

Nishith Agarwal edited comment on HUDI-433 at 1/7/20 9:20 PM:
--

[~vinoth]  Yes, we should close this one. For some reason, I'm unable to move 
status, close or perform any action on Jira tickets, not sure why


was (Author: nagarwal):
Yes, we should close this one. For some reason, I'm unable to move status, 
close or perform any action on Jira tickets, not sure why

> Improve the way log block magic header is identified when a corrupt block is 
> encountered #416
> -
>
> Key: HUDI-433
> URL: https://issues.apache.org/jira/browse/HUDI-433
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> h1. Improve the way log block magic header is identified when a corrupt block 
> is encountered #416



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-41) Get rid of special casing Global Index for MOR rollback #394

2020-01-07 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010103#comment-17010103
 ] 

Nishith Agarwal commented on HUDI-41:
-

We do have some special casing 
[https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieMergeOnReadTable.java#L429]
 here, will need to revisit before closing.

> Get rid of special casing Global Index for MOR rollback #394
> 
>
> Key: HUDI-41
> URL: https://issues.apache.org/jira/browse/HUDI-41
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Nishith Agarwal
>Priority: Major
>
> https://github.com/uber/hudi/issues/394



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] zhedoubushishi commented on issue #1175: [HUDI-495] Update deprecated HBase API

2020-01-07 Thread GitBox

zhedoubushishi commented on issue #1175: [HUDI-495] Update deprecated HBase API
URL: https://github.com/apache/incubator-hudi/pull/1175#issuecomment-571788946
 
 
   I think in our cases doMutations method will always do flush right after 
doing mutator.mutate(...). So each time you call doMutations, at first the 
buffer should be empty. Therefore, if the mutations array is empty, we can just 
return without any flush operation.
   
   That why I rewrite it like this:
   ```
   if (mutations.isEmpty()) {
 return;
   }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1122: [HUDI-29]: Support hudi COW table to use ANALYZE TABLE table_name COMMPUTE STATISTICS to get table current rows

2020-01-07 Thread GitBox

bvaradar commented on a change in pull request #1122: [HUDI-29]: Support hudi 
COW table to use *ANALYZE TABLE table_name COMMPUTE STATISTICS* to get table 
current rows
URL: https://github.com/apache/incubator-hudi/pull/1122#discussion_r363992460
 
 

 ##
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##
 @@ -196,7 +199,58 @@ public Configuration getConf() {
 // ParquetInputFormat.setFilterPredicate(job, predicate);
 // clearOutExistingPredicate(job);
 // }
-return super.getRecordReader(split, job, reporter);
+
+final Path finalPath = ((FileSplit) split).getPath();
+FileSystem fileSystem = finalPath.getFileSystem(conf);
+FileStatus curFileStatus = fileSystem.getFileStatus(finalPath);
+
+HoodieTableMetaClient metadata;
+try {
+  metadata = getTableMetaClient(finalPath.getFileSystem(conf),
+  curFileStatus.getPath().getParent());
+} catch (DatasetNotFoundException | InvalidDatasetException e) {
+  LOG.info("Handling a non-hoodie path " + curFileStatus.getPath());
+  return super.getRecordReader(split, job, reporter);
+}
+
+if (LOG.isDebugEnabled()) {
+  LOG.debug("Hoodie Metadata initialized with completed commit Ts as :" + 
metadata);
+}
+String tableName = metadata.getTableConfig().getTableName();
+String mode = HoodieHiveUtil.readMode(Job.getInstance(job), tableName);
+
+if (HoodieHiveUtil.INCREMENTAL_SCAN_MODE.equals(mode)) {
+  return super.getRecordReader(split, job, reporter);
+} else {
+  List partitions = 
FSUtils.getAllFoldersWithPartitionMetaFile(metadata.getFs(), 
metadata.getBasePath());
 
 Review comment:
   @cdmikechen : For each input file split, we are essentially listing all 
partitions. At the minimum, we should only list the partition where the input 
split is. You can get the relative partition path from (1) basePath and (2) 
fullPath of the input split and use FileSystemView. Even then, this would be 
slow and resource intensive. The better solution would be to use consolidated 
Metadata but it is not available yet. Is it possible to enable this new 
codepath only for Compute Statistics but keep regular select queries go through 
the original path ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] leesf merged pull request #1189: [HUDI-376]: AWS Glue dependency issue for EMR 5.28.0

2020-01-07 Thread GitBox

leesf merged pull request #1189: [HUDI-376]: AWS Glue dependency issue for EMR 
5.28.0
URL: https://github.com/apache/incubator-hudi/pull/1189
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: [HUDI-376]: AWS Glue dependency issue for EMR 5.28.0 (#1189)

2020-01-07 Thread leesf

This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 6433e14  [HUDI-376]: AWS Glue dependency issue for EMR 5.28.0 (#1189)
6433e14 is described below

commit 6433e14bcd6e8036bdeeba9a2e8af273589511e9
Author: Xing Pan 
AuthorDate: Wed Jan 8 08:01:14 2020 +0800

[HUDI-376]: AWS Glue dependency issue for EMR 5.28.0 (#1189)
---
 content/s3_hoodie.html   | 1 -
 docs/s3_filesystem.cn.md | 5 +
 docs/s3_filesystem.md| 5 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/content/s3_hoodie.html b/content/s3_hoodie.html
index 8f64a4b..3bdec85 100644
--- a/content/s3_hoodie.html
+++ b/content/s3_hoodie.html
@@ -412,7 +412,6 @@ export 
HOODIE_ENV_fs_DOT_s3n_DOT_impl=org.apache.hadoop.fs.s3a.S3AFileSystem
   org.apache.hadoop:hadoop-aws:2.7.3
 
 
-
 
 
 
diff --git a/docs/s3_filesystem.cn.md b/docs/s3_filesystem.cn.md
index 45dcbab..a848857 100644
--- a/docs/s3_filesystem.cn.md
+++ b/docs/s3_filesystem.cn.md
@@ -75,3 +75,8 @@ AWS hadoop libraries to add to our classpath
 
  - com.amazonaws:aws-java-sdk:1.10.34
  - org.apache.hadoop:hadoop-aws:2.7.3
+
+AWS glue data libraries are needed if AWS glue data is used
+
+ - com.amazonaws.glue:aws-glue-datacatalog-hive2-client:1.11.0
+ - com.amazonaws:aws-java-sdk-glue:1.11.475
diff --git a/docs/s3_filesystem.md b/docs/s3_filesystem.md
index 45dcbab..a848857 100644
--- a/docs/s3_filesystem.md
+++ b/docs/s3_filesystem.md
@@ -75,3 +75,8 @@ AWS hadoop libraries to add to our classpath
 
  - com.amazonaws:aws-java-sdk:1.10.34
  - org.apache.hadoop:hadoop-aws:2.7.3
+
+AWS glue data libraries are needed if AWS glue data is used
+
+ - com.amazonaws.glue:aws-glue-datacatalog-hive2-client:1.11.0
+ - com.amazonaws:aws-java-sdk-glue:1.11.475

[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1198: [MINOR] Remove old jekyll config file

2020-01-07 Thread GitBox

lamber-ken opened a new pull request #1198: [MINOR] Remove old jekyll config 
file
URL: https://github.com/apache/incubator-hudi/pull/1198
 
 
   ## What is the purpose of the pull request
   
   Remove old jekyll config file, `_config.yml` is useless in master branch.
   
   ## Brief change log
   
 - *Remove old jekyll config file*
   
   ## Verify this pull request
   
   This pull request is code cleanup without any test coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Resolved] (HUDI-444) Refactor the codes based on scala codestyle NullChecker rule

2020-01-07 Thread lamber-ken (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-444.
-
Resolution: Fixed

Fixed at master 313fab5fd1ef715f98a123d0e09f6010daacab68

> Refactor the codes based on scala codestyle NullChecker rule
> 
>
> Key: HUDI-444
> URL: https://issues.apache.org/jira/browse/HUDI-444
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Refactor the codes based on scala codestyle NullChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-417) Refactor HoodieWriteClient so that commit logic can be shareable by both bootstrap and normal write operations

2020-01-07 Thread Nicholas Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010298#comment-17010298
 ] 

Nicholas Jiang commented on HUDI-417:
-

[~vbalaji]OK. I will try Spark datasource.

> Refactor HoodieWriteClient so that commit logic can be shareable by both 
> bootstrap and normal write operations
> --
>
> Key: HUDI-417
> URL: https://issues.apache.org/jira/browse/HUDI-417
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>  
> Basic Code Changes are present in the fork : 
> [https://github.com/bvaradar/hudi/tree/vb_bootstrap]
>  
> The current implementation of HoodieBootstrapClient has duplicate code for 
> committing bootstrap. 
> [https://github.com/bvaradar/hudi/blob/vb_bootstrap/hudi-client/src/main/java/org/apache/hudi/bootstrap/HoodieBootstrapClient.java]
>  
>  
> We can have an independent PR which would move these commit functionality 
> from HoodieWriteClient to a new base class AbstractHoodieWriteClient which 
> HoodieBootstrapClient can inherit.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-472) Make sortBy() inside bulkInsertInternal() configurable for bulk_insert

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-472:

Fix Version/s: (was: 0.5.1)
   0.5.2

> Make sortBy() inside bulkInsertInternal() configurable for bulk_insert
> --
>
> Key: HUDI-472
> URL: https://issues.apache.org/jira/browse/HUDI-472
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Ethan Guo
>Assignee: He ZongPing
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-483) Fix unit test for Archiving to reflect empty instant files for requested commit/deltacommits

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-483:

Status: Open  (was: New)

> Fix unit test for Archiving to reflect empty instant files for requested 
> commit/deltacommits
> 
>
> Key: HUDI-483
> URL: https://issues.apache.org/jira/browse/HUDI-483
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Balaji Varadarajan
>Priority: Minor
> Fix For: 0.5.1
>
>
> This came up during review:
> [https://github.com/apache/incubator-hudi/pull/1128#discussion_r361734393]
> HoodieTestDataGenerator.createCommitFile() creates requested files with 
> proper commit metadata for test data generation. It needs to create en empty 
> file to reflect reality.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-439) Fix HoodieSparkSqlWriter wrt code refactoring

2020-01-07 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-439.

Resolution: Duplicate

HUDI-438

> Fix HoodieSparkSqlWriter wrt code refactoring
> -
>
> Key: HUDI-439
> URL: https://issues.apache.org/jira/browse/HUDI-439
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.5.1
>
>
> HoodieSparkSqlWriter have some common code paths for write and delete paths. 
> When I added support for deletes, it wasn't easy to have common code paths 
> due to HoodieWriteClient having generic type in java and scala expected to 
> declare the type. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] Guru107 edited a comment on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

Guru107 edited a comment on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571910084
 
 
   @bvaradar .aux folder is empty. Will, there be any issue if I deleted those 
empty files, and let subsequent runs create proper cleans files


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

bvaradar commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571919614
 
 
   @Guru107 : My bad, It should be empty with the new format. In this case, 
just deleting the .clean.requested should work. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

lamber-ken commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571919780
 
 
   > @bvaradar Yes, it is part of the code, I checked it. I think the problem 
came because, I ran the job when the bug was introduced, so it created many 
empty files. After the bug was fixed, I assume it only fixed by adding content 
back to `clean.requested` files (new `clean.requested` files are not empty), 
but there is no code to handle the empty files that were created due to the bug.
   
   You're right, this pr did not handle the empty files which were created due 
to that bug. Because most people haven't been affected by HUDI-308. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

lamber-ken edited a comment on issue #1128: [HUDI-453] Fix throw failed to 
archive commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571919780
 
 
   > @bvaradar Yes, it is part of the code, I checked it. I think the problem 
came because, I ran the job when the bug was introduced, so it created many 
empty files. After the bug was fixed, I assume it only fixed by adding content 
back to `clean.requested` files (new `clean.requested` files are not empty), 
but there is no code to handle the empty files that were created due to the bug.
   
   You're right, this pr did not handle the empty files which were created due 
to that bug, because most people haven't been affected by HUDI-308. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] hddong edited a comment on issue #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-01-07 Thread GitBox

hddong edited a comment on issue #1157: [HUDI-332]Add operation type 
(insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#issuecomment-571921938
 
 
   > @bvaradar Operation type is stored in the avro objects when archiving, but 
there are a error here, it throw `ClassCastException: 
org.apache.avro.generic.GenericData$EnumSymbol cannot be cast to 
org.apache.hudi.avro.model.WriteOperationType` with `deepCopy` block in 'show 
archived commit stats' . Can you give any suggestion when you're free.
   
   @bvaradar I found it cause by 
[AVRO-1676](https://issues.apache.org/jira/browse/AVRO-1676) and fixed in 
AVRO-1.8.0. So can i roll back `operateType ` to String type?  It may cause 
other compatibility problem If upgrade avro.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] hddong commented on issue #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-01-07 Thread GitBox

hddong commented on issue #1157: [HUDI-332]Add operation type 
(insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#issuecomment-571921938
 
 
   > @bvaradar Operation type is stored in the avro objects when archiving, but 
there are a error here, it throw `ClassCastException: 
org.apache.avro.generic.GenericData$EnumSymbol cannot be cast to 
org.apache.hudi.avro.model.WriteOperationType` with `deepCopy` block in 'show 
archived commit stats' . Can you give any suggestion when you're free.
   
   @bvaradar I found it cause by AVRO-1676 and fixed in AVRO-1.8.0. So can i 
roll back `operateType ` to String type?  It may cause other compatibility 
problem If upgrade avro.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] Guru107 commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

Guru107 commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571910084
 
 
   @bvaradar .aux folder is empty


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

bvaradar commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571920527
 
 
   @Guru107 : The reason why there is no tooling to seamlessly fix this is 
because this was a bug that manifested in a non-release (master) branch as part 
of a PR merge and the fix is already merged. 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] hddong edited a comment on issue #1157: [HUDI-332]Add operation type (insert/upsert/bulkinsert/delete) to HoodieCommitMetadata

2020-01-07 Thread GitBox

hddong edited a comment on issue #1157: [HUDI-332]Add operation type 
(insert/upsert/bulkinsert/delete) to HoodieCommitMetadata
URL: https://github.com/apache/incubator-hudi/pull/1157#issuecomment-571921938
 
 
   > @bvaradar Operation type is stored in the avro objects when archiving, but 
there are a error here, it throw `ClassCastException: 
org.apache.avro.generic.GenericData$EnumSymbol cannot be cast to 
org.apache.hudi.avro.model.WriteOperationType` with `deepCopy` block in 'show 
archived commit stats' . Can you give any suggestion when you're free.
   
   @bvaradar I found it cause by 
[AVRO-1676](https://issues.apache.org/jira/browse/AVRO-1676) and fixed in 
AVRO-1.8.0. So can i roll back `operationType ` to String type?  It may cause 
other compatibility problem If upgrade avro.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1122: [HUDI-29]: Support hudi COW table to use ANALYZE TABLE table_name COMMPUTE STATISTICS to get table current rows

2020-01-07 Thread GitBox

bhasudha commented on a change in pull request #1122: [HUDI-29]: Support hudi 
COW table to use *ANALYZE TABLE table_name COMMPUTE STATISTICS* to get table 
current rows
URL: https://github.com/apache/incubator-hudi/pull/1122#discussion_r364017471
 
 

 ##
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##
 @@ -196,7 +199,58 @@ public Configuration getConf() {
 // ParquetInputFormat.setFilterPredicate(job, predicate);
 // clearOutExistingPredicate(job);
 // }
-return super.getRecordReader(split, job, reporter);
+
+final Path finalPath = ((FileSplit) split).getPath();
+FileSystem fileSystem = finalPath.getFileSystem(conf);
+FileStatus curFileStatus = fileSystem.getFileStatus(finalPath);
+
+HoodieTableMetaClient metadata;
+try {
+  metadata = getTableMetaClient(finalPath.getFileSystem(conf),
+  curFileStatus.getPath().getParent());
+} catch (DatasetNotFoundException | InvalidDatasetException e) {
+  LOG.info("Handling a non-hoodie path " + curFileStatus.getPath());
+  return super.getRecordReader(split, job, reporter);
+}
+
+if (LOG.isDebugEnabled()) {
+  LOG.debug("Hoodie Metadata initialized with completed commit Ts as :" + 
metadata);
+}
+String tableName = metadata.getTableConfig().getTableName();
+String mode = HoodieHiveUtil.readMode(Job.getInstance(job), tableName);
+
+if (HoodieHiveUtil.INCREMENTAL_SCAN_MODE.equals(mode)) {
+  return super.getRecordReader(split, job, reporter);
+} else {
+  List partitions = 
FSUtils.getAllFoldersWithPartitionMetaFile(metadata.getFs(), 
metadata.getBasePath());
 
 Review comment:
   I agree with @bvaradar . Some stats either in partition metadata file or a 
consolidated metadata is the better approach. @cdmikechen have you already 
explored what it takes to make this work only for Compute Statistics code path? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-500) Use enum method to replace switch case in HoodieTableMetaClient

2020-01-07 Thread dengziming (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010238#comment-17010238
 ] 

dengziming commented on HUDI-500:
-

this is invalid change, who helps to close it, thank you.

> Use enum method to replace switch case in HoodieTableMetaClient
> ---
>
> Key: HUDI-500
> URL: https://issues.apache.org/jira/browse/HUDI-500
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: dengziming
>Assignee: dengziming
>Priority: Minor
>  Labels: pull-request-available, refactor, starter
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> 4 methods in HoodieTableMetaClient (maybe more) use `switch 
> (HoodieTableType)`, which can be substituted with enum method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-507) Support \ t split hdfs source

2020-01-07 Thread liujinhui (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010247#comment-17010247
 ] 

liujinhui commented on HUDI-507:


[~vinoth]  _please give me the contributor permission， Email sent before, but 
not processed， thandks_

> Support \ t split hdfs source
> -
>
> Key: HUDI-507
> URL: https://issues.apache.org/jira/browse/HUDI-507
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: liujinhui
>Priority: Minor
> Fix For: 0.5.1
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> hi,hudi
>  
> Current Hudi data source does not support HDFS file data splitting with \ t 
> separator
>  I want to complete it and contribute to the community.
>  The main change is the addition of the TextDFSSource class to provide 
> support.
>  The specific new logic is: split the hdfs data according to the delimiter, 
> and then map it to the source.avsc pattern
>  
> Or do some other symbol format as an extension
> thanks,
> liujh
>  
> [~vinoth]   Please help with suggestions
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Issue Comment Deleted] (HUDI-507) Support \ t split hdfs source

2020-01-07 Thread liujinhui (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-507:
---
Comment: was deleted

(was: [~vinoth]  _please give me the contributor permission， Email sent before, 
but not processed， thandks_)

> Support \ t split hdfs source
> -
>
> Key: HUDI-507
> URL: https://issues.apache.org/jira/browse/HUDI-507
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: liujinhui
>Priority: Minor
> Fix For: 0.5.1
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> hi,hudi
>  
> Current Hudi data source does not support HDFS file data splitting with \ t 
> separator
>  I want to complete it and contribute to the community.
>  The main change is the addition of the TextDFSSource class to provide 
> support.
>  The specific new logic is: split the hdfs data according to the delimiter, 
> and then map it to the source.avsc pattern
>  
> Or do some other symbol format as an extension
> thanks,
> liujh
>  
> [~vinoth]   Please help with suggestions
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-450) Refactor the codes based on scala codestyle MagicNumberChecker rule

2020-01-07 Thread lamber-ken (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010268#comment-17010268
 ] 

lamber-ken commented on HUDI-450:
-

hi because of the checkstyle rules are under discusstion, don't fix these 
checkstyle issues currently. 
I'll let you know later. :)

> Refactor the codes based on scala codestyle MagicNumberChecker rule
> ---
>
> Key: HUDI-450
> URL: https://issues.apache.org/jira/browse/HUDI-450
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: Zijie Lu
>Priority: Major
>
> Refactor the codes based on scala codestyle MagicNumberChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-447) Refactor the codes based on scala codestyle IfBraceChecker rule

2020-01-07 Thread lamber-ken (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010267#comment-17010267
 ] 

lamber-ken commented on HUDI-447:
-

hi because of the checkstyle rules are under discusstion, don't fix these 
checkstyle issues currently, I'll let you know later. :)

> Refactor the codes based on scala codestyle IfBraceChecker rule
> ---
>
> Key: HUDI-447
> URL: https://issues.apache.org/jira/browse/HUDI-447
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: Zijie Lu
>Priority: Major
>
> Refactor the codes based on scala codestyle IfBraceChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-507) Support \ t split hdfs source

2020-01-07 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010281#comment-17010281
 ] 

Vinoth Chandar commented on HUDI-507:
-

 I apologize. I probably missed some notification.. You have perms now.. This 
does seem like a good source to add! 

> Support \ t split hdfs source
> -
>
> Key: HUDI-507
> URL: https://issues.apache.org/jira/browse/HUDI-507
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: liujinhui
>Priority: Minor
> Fix For: 0.5.1
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> hi,hudi
>  
> Current Hudi data source does not support HDFS file data splitting with \ t 
> separator
>  I want to complete it and contribute to the community.
>  The main change is the addition of the TextDFSSource class to provide 
> support.
>  The specific new logic is: split the hdfs data according to the delimiter, 
> and then map it to the source.avsc pattern
>  
> Or do some other symbol format as an extension
> thanks,
> liujh
>  
> [~vinoth]   Please help with suggestions
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on issue #1122: [HUDI-29]: Support hudi COW table to use ANALYZE TABLE table_name COMMPUTE STATISTICS to get table current rows

2020-01-07 Thread GitBox

vinothchandar commented on issue #1122: [HUDI-29]: Support hudi COW table to 
use *ANALYZE TABLE table_name COMMPUTE STATISTICS* to get table current rows
URL: https://github.com/apache/incubator-hudi/pull/1122#issuecomment-571864785
 
 
   cc @n3nash as well.. Lets ensure we don't have a regression here.. 
   @bvaradar @n3nash could we port the patch we had at Uber for this? or adapt 
this similarly.. I think this is a really good one to get done though


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #153

2020-01-07 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.17 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark [jar]
[INFO] hudi-utilities [jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle  [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle  [jar]
[INFO] hudi-timeline-server-bundle

[GitHub] [incubator-hudi] Guru107 commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

Guru107 commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571891243
 
 
   @bvaradar Yes, it is part of the code, I checked it. I think the problem 
came because, I ran the job when the bug was introduced, so it created many 
empty files. After the bug was fixed, I assume it only fixed by adding content 
back to `clean.requested` files, but there is no code to handle the empty files 
that were created by the bug.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-493) Add docs for delete support in Hudi client apis

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-493:

Status: Open  (was: New)

> Add docs for delete support in Hudi client apis
> ---
>
> Key: HUDI-493
> URL: https://issues.apache.org/jira/browse/HUDI-493
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.5.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-506) Broken/Wrong links in new website

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-506:
---

Assignee: lamber-ken

> Broken/Wrong links in new website
> -
>
> Key: HUDI-506
> URL: https://issues.apache.org/jira/browse/HUDI-506
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.5.1
>
>
> Few issues.
> 1. Under Quicstart -> set up spark shell
>  "Data-generator" links to QuickStartUtils? Is that intended.
> 2. Under Quicstart ->Insert data
> "Modelling data stored in hudi" should it not link to 
> "https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi;.
>  Why linking it to general FAQ page. 
> 3. Under Quicstart -> Update data
> "commit" links to "concepts" page. is that intended? 
> 4. Link to "file a jira" is taking to summary page in hudi. Should we fix it 
> to launch "create new ticket" with some fields (like labels or tags as 
> needed) auto populated so that we can track them. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-506) Broken/Wrong links in new website

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-506:

Status: Open  (was: New)

> Broken/Wrong links in new website
> -
>
> Key: HUDI-506
> URL: https://issues.apache.org/jira/browse/HUDI-506
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.5.1
>
>
> Few issues.
> 1. Under Quicstart -> set up spark shell
>  "Data-generator" links to QuickStartUtils? Is that intended.
> 2. Under Quicstart ->Insert data
> "Modelling data stored in hudi" should it not link to 
> "https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#FAQ-HowdoImodelthedatastoredinHudi;.
>  Why linking it to general FAQ page. 
> 3. Under Quicstart -> Update data
> "commit" links to "concepts" page. is that intended? 
> 4. Link to "file a jira" is taking to summary page in hudi. Should we fix it 
> to launch "create new ticket" with some fields (like labels or tags as 
> needed) auto populated so that we can track them. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-86) Add indexing support to the log file format

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-86?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-86:
---
Fix Version/s: (was: 0.5.1)
   0.6.0

> Add indexing support to the log file format
> ---
>
> Key: HUDI-86
> URL: https://issues.apache.org/jira/browse/HUDI-86
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Index, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: realtime-data-lakes
> Fix For: 0.6.0
>
>
> https://github.com/apache/incubator-hudi/pull/519



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-238) Make separate release for hudi spark/scala based packages for scala 2.12

2020-01-07 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010372#comment-17010372
 ] 

Vinoth Chandar commented on HUDI-238:
-

Hi are you still working on this? 

> Make separate release for hudi spark/scala based packages for scala 2.12 
> -
>
> Key: HUDI-238
> URL: https://issues.apache.org/jira/browse/HUDI-238
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Release  Administrative, Usability
>Reporter: Balaji Varadarajan
>Assignee: Tadas Sugintas
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/881#issuecomment-528700749]
> Suspects: 
> h3. Hudi utilities package 
> bringing in spark-streaming-kafka-0.8* 
> {code:java}
> [INFO] Scanning for projects...
> [INFO] 
> [INFO] ---< org.apache.hudi:hudi-utilities 
> >---
> [INFO] Building hudi-utilities 0.5.0-SNAPSHOT
> [INFO] [ jar 
> ]-
> [INFO] 
> [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-utilities 
> ---
> [INFO] org.apache.hudi:hudi-utilities:jar:0.5.0-SNAPSHOT
> [INFO] ...
> [INFO] +- org.apache.hudi:hudi-client:jar:0.5.0-SNAPSHOT:compile
>...
> [INFO] 
> [INFO] +- org.apache.hudi:hudi-spark:jar:0.5.0-SNAPSHOT:compile
> [INFO] |  \- org.scala-lang:scala-library:jar:2.11.8:compile
> [INFO] +- log4j:log4j:jar:1.2.17:compile
>...
> [INFO] +- org.apache.spark:spark-core_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
> [INFO] |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
> [INFO] |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
> [INFO] |  +- com.twitter:chill_2.11:jar:0.8.0:provided
> [INFO] |  +- com.twitter:chill-java:jar:0.8.0:provided
> [INFO] |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
> [INFO] |  +- org.apache.spark:spark-launcher_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-common_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-shuffle_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-unsafe_2.11:jar:2.1.0:provided
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:provided
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.5:provided
> [INFO] |  +- org.apache.commons:commons-math3:jar:3.4.1:provided
> [INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:provided
> [INFO] |  +- org.slf4j:slf4j-api:jar:1.7.16:compile
> [INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.16:compile
> [INFO] |  +- com.ning:compress-lzf:jar:1.0.3:provided
> [INFO] |  +- org.xerial.snappy:snappy-java:jar:1.1.2.6:compile
> [INFO] |  +- net.jpountz.lz4:lz4:jar:1.3.0:compile
> [INFO] |  +- org.roaringbitmap:RoaringBitmap:jar:0.5.11:provided
> [INFO] |  +- commons-net:commons-net:jar:2.2:provided
>
> [INFO] +- org.apache.spark:spark-sql_2.11:jar:2.1.0:provided
> [INFO] |  +- com.univocity:univocity-parsers:jar:2.2.1:provided
> [INFO] |  +- org.apache.spark:spark-sketch_2.11:jar:2.1.0:provided
> [INFO] |  \- org.apache.spark:spark-catalyst_2.11:jar:2.1.0:provided
> [INFO] | +- org.codehaus.janino:janino:jar:3.0.0:provided
> [INFO] | +- org.codehaus.janino:commons-compiler:jar:3.0.0:provided
> [INFO] | \- org.antlr:antlr4-runtime:jar:4.5.3:provided
> [INFO] +- com.databricks:spark-avro_2.11:jar:4.0.0:provided
> [INFO] +- org.apache.spark:spark-streaming_2.11:jar:2.1.0:compile
> [INFO] +- org.apache.spark:spark-streaming-kafka-0-8_2.11:jar:2.1.0:compile
> [INFO] |  \- org.apache.kafka:kafka_2.11:jar:0.8.2.1:compile
> [INFO] | +- org.scala-lang.modules:scala-xml_2.11:jar:1.0.2:compile
> [INFO] | +- 
> org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.2:compile
> [INFO] | \- org.apache.kafka:kafka-clients:jar:0.8.2.1:compile
> [INFO] +- io.dropwizard.metrics:metrics-core:jar:4.0.2:compile
> [INFO] +- org.antlr:stringtemplate:jar:4.0.2:compile
> [INFO] |  \- org.antlr:antlr-runtime:jar:3.3:compile
> [INFO] +- com.beust:jcommander:jar:1.72:compile
> [INFO] +- com.twitter:bijection-avro_2.11:jar:0.9.2:compile
> [INFO] |  \- com.twitter:bijection-core_2.11:jar:0.9.2:compile
> [INFO] +- io.confluent:kafka-avro-serializer:jar:3.0.0:compile
> [INFO] +- io.confluent:common-config:jar:3.0.0:compile
> [INFO] +- io.confluent:common-utils:jar:3.0.0:compile
> [INFO] |  \- com.101tec:zkclient:jar:0.5:compile
> [INFO] +-

[jira] [Created] (HUDI-512) Decouple logical partitioning from physical one.

2020-01-07 Thread Alexander Filipchik (Jira)

Alexander Filipchik created HUDI-512:


 Summary: Decouple logical partitioning from physical one. 
 Key: HUDI-512
 URL: https://issues.apache.org/jira/browse/HUDI-512
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Common Core
Reporter: Alexander Filipchik


This one is more inspirational, but, I believe, will be very useful. Currently 
hudi is following Hive table format, which means that data is logically and 
physically partitioned into folder structure like:

table_name

  2019

    01

    02

       bla.parquet

 

This has several issues:

 1) Modern object sores (AWS S3, GCP) are more performant when each file name 
starts with some kind of a random value. By definition Hive layout is not 
perfect

2) Hive Metastore stores partitions in the text field in the single table (2 
tables with very similar information) and doesn't support proper filtering. 
Data partitioned by day will be stored like:

2019/01/10

2019/01/11

so only regexp queries are suported (at least in Hive 2.X.X)

3) Having a single POF which relies on non distributed DB is dangerous and 
creates bottlenecks. 

 

The idea is to get rid of logical partitioning all together. If dataset has a 
time columns, user should be able to query it without understanding what is the 
physical layout of the table (by specifying those partitions explicitly or 
ending up with a full table scan accidentally).

It will require some kind of mapping of time to file locations (similar to 
Iceberg). I'm also leaning towards the idea that storing table metadata with 
the table is a good thing as it can be read by the engine in one shot and will 
be faster that taxing a standalone metastore. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] bvaradar commented on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

bvaradar commented on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571898125
 
 
   @Guru107 : The corresponding non-empty clean.requested is under .aux folder. 
 As a one time operation, can you copy the  files to .hoodie/


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-512) Decouple logical partitioning from physical one.

2020-01-07 Thread Alexander Filipchik (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Filipchik updated HUDI-512:
-
Description: 
This one is more inspirational, but, I believe, will be very useful. Currently 
hudi is following Hive table format, which means that data is logically and 
physically partitioned into folder structure like:

table_name

  2019

    01

    02

       bla.parquet

 

This has several issues:

 1) Modern object sores (AWS S3, GCP) are more performant when each file name 
starts with some kind of a random value. By definition Hive layout is not 
perfect

2) Hive Metastore stores partitions in the text field in the single table (2 
tables with very similar information) and doesn't support proper filtering. 
Data partitioned by day will be stored like:

2019/01/10

2019/01/11

so only regexp queries are suported (at least in Hive 2.X.X)

3) Having a single POF which relies on non distributed DB is dangerous and 
creates bottlenecks. 

 

The idea is to get rid of logical partitioning all together (and hive metastore 
as well). If dataset has a time columns, user should be able to query it 
without understanding what is the physical layout of the table (by specifying 
those partitions explicitly or ending up with a full table scan accidentally).

It will require some kind of mapping of time to file locations (similar to 
Iceberg). I'm also leaning towards the idea that storing table metadata with 
the table is a good thing as it can be read by the engine in one shot and will 
be faster that taxing a standalone metastore. 

  was:
This one is more inspirational, but, I believe, will be very useful. Currently 
hudi is following Hive table format, which means that data is logically and 
physically partitioned into folder structure like:

table_name

  2019

    01

    02

       bla.parquet

 

This has several issues:

 1) Modern object sores (AWS S3, GCP) are more performant when each file name 
starts with some kind of a random value. By definition Hive layout is not 
perfect

2) Hive Metastore stores partitions in the text field in the single table (2 
tables with very similar information) and doesn't support proper filtering. 
Data partitioned by day will be stored like:

2019/01/10

2019/01/11

so only regexp queries are suported (at least in Hive 2.X.X)

3) Having a single POF which relies on non distributed DB is dangerous and 
creates bottlenecks. 

 

The idea is to get rid of logical partitioning all together. If dataset has a 
time columns, user should be able to query it without understanding what is the 
physical layout of the table (by specifying those partitions explicitly or 
ending up with a full table scan accidentally).

It will require some kind of mapping of time to file locations (similar to 
Iceberg). I'm also leaning towards the idea that storing table metadata with 
the table is a good thing as it can be read by the engine in one shot and will 
be faster that taxing a standalone metastore. 


> Decouple logical partitioning from physical one. 
> -
>
> Key: HUDI-512
> URL: https://issues.apache.org/jira/browse/HUDI-512
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: features
>
> This one is more inspirational, but, I believe, will be very useful. 
> Currently hudi is following Hive table format, which means that data is 
> logically and physically partitioned into folder structure like:
> table_name
>   2019
>     01
>     02
>        bla.parquet
>  
> This has several issues:
>  1) Modern object sores (AWS S3, GCP) are more performant when each file name 
> starts with some kind of a random value. By definition Hive layout is not 
> perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2 
> tables with very similar information) and doesn't support proper filtering. 
> Data partitioned by day will be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and 
> creates bottlenecks. 
>  
> The idea is to get rid of logical partitioning all together (and hive 
> metastore as well). If dataset has a time columns, user should be able to 
> query it without understanding what is the physical layout of the table (by 
> specifying those partitions explicitly or ending up with a full table scan 
> accidentally).
> It will require some kind of mapping of time to file locations (similar to 
> Iceberg). I'm also leaning towards the idea that storing table metadata with 
> the table is a good thing as it can be read by the engine in one shot and 
> will be faster that taxing a standalone

[jira] [Resolved] (HUDI-440) Rework the hudi web site

2020-01-07 Thread lamber-ken (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-440.
-
Resolution: Fixed

Fixed at asf-site 312711dd220f5ffaeecfe711f9011b651ded72a2

> Rework the hudi web site
> 
>
> Key: HUDI-440
> URL: https://issues.apache.org/jira/browse/HUDI-440
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Rework the hudi web site, the old web site is based on jekyll-doc[1] theme
> which is not active, replace it with minimal-mistakes[2] which is very 
> popular and 100% free.
>  
>  
> [1] https://github.com/tomjoht/documentation-theme-jekyll
> [2] https://github.com/mmistakes/minimal-mistakes
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-507) Support \ t split hdfs source

2020-01-07 Thread liujinhui (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-507:
---
Description: 
hi,hudi

 

Current Hudi data source does not support HDFS file data splitting with \ t 
separator
 I want to complete it and contribute to the community.
 The main change is the addition of the TextDFSSource class to provide support.
 The specific new logic is: split the hdfs data according to the delimiter, and 
then map it to the source.avsc pattern

 

Or do some other symbol format as an extension

thanks,

liujh

 

[~vinoth]

 

  was:
hi,hudi

 

Current Hudi data source does not support HDFS file data splitting with \ t 
separator
 I want to complete it and contribute to the community.
 The main change is the addition of the TextDFSSource class to provide support.
 The specific new logic is: split the hdfs data according to the delimiter, and 
then map it to the source.avsc pattern

 

Or do some other symbol format as an extension

thanks,

liujh

 

@ino

 


> Support \ t split hdfs source
> -
>
> Key: HUDI-507
> URL: https://issues.apache.org/jira/browse/HUDI-507
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: liujinhui
>Priority: Minor
> Fix For: 0.5.1
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> hi,hudi
>  
> Current Hudi data source does not support HDFS file data splitting with \ t 
> separator
>  I want to complete it and contribute to the community.
>  The main change is the addition of the TextDFSSource class to provide 
> support.
>  The specific new logic is: split the hdfs data according to the delimiter, 
> and then map it to the source.avsc pattern
>  
> Or do some other symbol format as an extension
> thanks,
> liujh
>  
> [~vinoth]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-450) Refactor the codes based on scala codestyle MagicNumberChecker rule

2020-01-07 Thread Zijie Lu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zijie Lu reassigned HUDI-450:
-

Assignee: Zijie Lu

> Refactor the codes based on scala codestyle MagicNumberChecker rule
> ---
>
> Key: HUDI-450
> URL: https://issues.apache.org/jira/browse/HUDI-450
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: Zijie Lu
>Priority: Major
>
> Refactor the codes based on scala codestyle MagicNumberChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-447) Refactor the codes based on scala codestyle IfBraceChecker rule

2020-01-07 Thread Zijie Lu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zijie Lu reassigned HUDI-447:
-

Assignee: Zijie Lu

> Refactor the codes based on scala codestyle IfBraceChecker rule
> ---
>
> Key: HUDI-447
> URL: https://issues.apache.org/jira/browse/HUDI-447
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: lamber-ken
>Assignee: Zijie Lu
>Priority: Major
>
> Refactor the codes based on scala codestyle IfBraceChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-507) Support \ t split hdfs source

2020-01-07 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010282#comment-17010282
 ] 

Vinoth Chandar commented on HUDI-507:
-

There is a PR open for CSV source.. May be can see if sharing code is possible 
or borrow the approach from that? 

> Support \ t split hdfs source
> -
>
> Key: HUDI-507
> URL: https://issues.apache.org/jira/browse/HUDI-507
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: liujinhui
>Priority: Minor
> Fix For: 0.5.1
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> hi,hudi
>  
> Current Hudi data source does not support HDFS file data splitting with \ t 
> separator
>  I want to complete it and contribute to the community.
>  The main change is the addition of the TextDFSSource class to provide 
> support.
>  The specific new logic is: split the hdfs data according to the delimiter, 
> and then map it to the source.avsc pattern
>  
> Or do some other symbol format as an extension
> thanks,
> liujh
>  
> [~vinoth]   Please help with suggestions
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-242) Support Efficient bootstrap of large parquet datasets to Hudi

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-242:

Fix Version/s: (was: 0.5.1)
   0.6.0

> Support Efficient bootstrap of large parquet datasets to Hudi
> -
>
> Key: HUDI-242
> URL: https://issues.apache.org/jira/browse/HUDI-242
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
>  Support Efficient bootstrap of large parquet tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-289) Implement a test suite to support long running test for Hudi writing and querying end-end

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-289:

Fix Version/s: (was: 0.5.1)
   0.5.2

> Implement a test suite to support long running test for Hudi writing and 
> querying end-end
> -
>
> Key: HUDI-289
> URL: https://issues.apache.org/jira/browse/HUDI-289
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: vinoyang
>Priority: Major
> Fix For: 0.5.2
>
>
> We would need an equivalent of an end-end test which runs some workload for 
> few hours atleast, triggers various actions like commit, deltacopmmit, 
> rollback, compaction and ensures correctness of code before every release
> P.S: Learn from all the CSS issues managing compaction..
> The feature branch is here: 
> [https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-322) DeltaSteamer should pick checkpoints off only deltacommits for MOR tables

2020-01-07 Thread Shahida Khan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010379#comment-17010379
 ] 

Shahida Khan commented on HUDI-322:
---

Yes! [~vinoth] 
I will need some more time, I know this should have been freeze by 15th of this 
month, but i won't be able to complete.

I am still trying to understand the code base and overall working of 
DeltaStreamer.
You can assigned this to another person so that it could get completed before 
the release.

 

> DeltaSteamer should pick checkpoints off only deltacommits for MOR tables
> -
>
> Key: HUDI-322
> URL: https://issues.apache.org/jira/browse/HUDI-322
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer, newbie
>Reporter: Vinoth Chandar
>Assignee: Shahida Khan
>Priority: Major
> Fix For: 0.5.1
>
>
> When using DeltaStreamer with MOR, the checkpoints would be written out to 
> .deltacommit files (and not .commit files). We need to confirm the behavior 
> and change code such that it reads from the correct metadata file..  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-403) Publish a deployment guide talking about deployment options, upgrading etc

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-403:

Status: In Progress  (was: Open)

> Publish a deployment guide talking about deployment options, upgrading etc
> --
>
> Key: HUDI-403
> URL: https://issues.apache.org/jira/browse/HUDI-403
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
>
> Things to cover 
>  # Upgrade readers first, Upgrade writers next, Principles of compatibility 
> followed
>  # DeltaStreamer Deployment models
>  # Scheduling Compactions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-07 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-288:

Fix Version/s: (was: 0.5.1)
   0.6.0

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1122: [HUDI-29]: Support hudi COW table to use ANALYZE TABLE table_name COMMPUTE STATISTICS to get table current rows

2020-01-07 Thread GitBox

bhasudha commented on a change in pull request #1122: [HUDI-29]: Support hudi 
COW table to use *ANALYZE TABLE table_name COMMPUTE STATISTICS* to get table 
current rows
URL: https://github.com/apache/incubator-hudi/pull/1122#discussion_r364017471
 
 

 ##
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##
 @@ -196,7 +199,58 @@ public Configuration getConf() {
 // ParquetInputFormat.setFilterPredicate(job, predicate);
 // clearOutExistingPredicate(job);
 // }
-return super.getRecordReader(split, job, reporter);
+
+final Path finalPath = ((FileSplit) split).getPath();
+FileSystem fileSystem = finalPath.getFileSystem(conf);
+FileStatus curFileStatus = fileSystem.getFileStatus(finalPath);
+
+HoodieTableMetaClient metadata;
+try {
+  metadata = getTableMetaClient(finalPath.getFileSystem(conf),
+  curFileStatus.getPath().getParent());
+} catch (DatasetNotFoundException | InvalidDatasetException e) {
+  LOG.info("Handling a non-hoodie path " + curFileStatus.getPath());
+  return super.getRecordReader(split, job, reporter);
+}
+
+if (LOG.isDebugEnabled()) {
+  LOG.debug("Hoodie Metadata initialized with completed commit Ts as :" + 
metadata);
+}
+String tableName = metadata.getTableConfig().getTableName();
+String mode = HoodieHiveUtil.readMode(Job.getInstance(job), tableName);
+
+if (HoodieHiveUtil.INCREMENTAL_SCAN_MODE.equals(mode)) {
+  return super.getRecordReader(split, job, reporter);
+} else {
+  List partitions = 
FSUtils.getAllFoldersWithPartitionMetaFile(metadata.getFs(), 
metadata.getBasePath());
 
 Review comment:
   I agree with @bvaradar . Some stats either in partition metadata file or a 
consolidated metadata is the better approach. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-507) Support \ t split hdfs source

2020-01-07 Thread liujinhui (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-507:
---
Description: 
hi,hudi

 

Current Hudi data source does not support HDFS file data splitting with \ t 
separator
 I want to complete it and contribute to the community.
 The main change is the addition of the TextDFSSource class to provide support.
 The specific new logic is: split the hdfs data according to the delimiter, and 
then map it to the source.avsc pattern

 

Or do some other symbol format as an extension

thanks,

liujh

 

[~vinoth]   Please help with suggestions

 

  was:
hi,hudi

 

Current Hudi data source does not support HDFS file data splitting with \ t 
separator
 I want to complete it and contribute to the community.
 The main change is the addition of the TextDFSSource class to provide support.
 The specific new logic is: split the hdfs data according to the delimiter, and 
then map it to the source.avsc pattern

 

Or do some other symbol format as an extension

thanks,

liujh

 

[~vinoth]

 


> Support \ t split hdfs source
> -
>
> Key: HUDI-507
> URL: https://issues.apache.org/jira/browse/HUDI-507
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: liujinhui
>Priority: Minor
> Fix For: 0.5.1
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> hi,hudi
>  
> Current Hudi data source does not support HDFS file data splitting with \ t 
> separator
>  I want to complete it and contribute to the community.
>  The main change is the addition of the TextDFSSource class to provide 
> support.
>  The specific new logic is: split the hdfs data according to the delimiter, 
> and then map it to the source.avsc pattern
>  
> Or do some other symbol format as an extension
> thanks,
> liujh
>  
> [~vinoth]   Please help with suggestions
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] Guru107 edited a comment on issue #1128: [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

2020-01-07 Thread GitBox

Guru107 edited a comment on issue #1128: [HUDI-453] Fix throw failed to archive 
commits error when writing data to MOR/COW table
URL: https://github.com/apache/incubator-hudi/pull/1128#issuecomment-571891243
 
 
   @bvaradar Yes, it is part of the code, I checked it. I think the problem 
came because, I ran the job when the bug was introduced, so it created many 
empty files. After the bug was fixed, I assume it only fixed by adding content 
back to `clean.requested` files (new `clean.requested` files are not empty), 
but there is no code to handle the empty files that were created due to the bug.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

1 2 >

1 - 100 of 126 matches

Mail list logo