[GitHub] [incubator-hudi] yanghua commented on issue #991: Hudi Test Suite (Refactor)

2019-11-27 Thread GitBox
yanghua commented on issue #991: Hudi Test Suite (Refactor) 
URL: https://github.com/apache/incubator-hudi/pull/991#issuecomment-559362785
 
 
   There are so many conflicts. I have tried to rebase with the master branch, 
however, it's hard to do this for multiple branches. So I would like to firstly 
squash the commits and rebase to fix conflicts. FYI, cc @n3nash @vinothchandar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua edited a comment on issue #991: Hudi Test Suite (Refactor)

2019-11-27 Thread GitBox
yanghua edited a comment on issue #991: Hudi Test Suite (Refactor) 
URL: https://github.com/apache/incubator-hudi/pull/991#issuecomment-559362785
 
 
   There are so many conflicts. I have tried to rebase with the master branch, 
however, it's hard to do this for multiple commits. So I would like to firstly 
squash the commits and rebase to fix conflicts. FYI, cc @n3nash @vinothchandar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1054: [HUDI-372] Support the shortName for Hudi DataSource

2019-11-27 Thread GitBox
lamber-ken edited a comment on issue #1054: [HUDI-372] Support the shortName 
for Hudi DataSource
URL: https://github.com/apache/incubator-hudi/pull/1054#issuecomment-559321835
 
 
   I am trying to figure out why ci build failed, waiting.
   
   
![image](https://user-images.githubusercontent.com/20113411/69774231-4e22e180-11d0-11ea-92d1-5dbf39dd510e.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1054: [HUDI-372] Support the shortName for Hudi DataSource

2019-11-27 Thread GitBox
lamber-ken commented on issue #1054: [HUDI-372] Support the shortName for Hudi 
DataSource
URL: https://github.com/apache/incubator-hudi/pull/1054#issuecomment-559332880
 
 
   hi, @vinothchandar it seems that the ci may has some bug, because the 
building fail reasons are different, WTDY?
   
   
https://travis-ci.org/apache/incubator-hudi/jobs/617982783?utm_medium=notification_source=github_status
   
   
https://travis-ci.org/apache/incubator-hudi/jobs/617720826?utm_medium=notification_source=github_status
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1054: [HUDI-372] Support the shortName for Hudi DataSource

2019-11-27 Thread GitBox
lamber-ken edited a comment on issue #1054: [HUDI-372] Support the shortName 
for Hudi DataSource
URL: https://github.com/apache/incubator-hudi/pull/1054#issuecomment-559321835
 
 
   I am trying to find out why ci build failed, waiting.
   
   
![image](https://user-images.githubusercontent.com/20113411/69774231-4e22e180-11d0-11ea-92d1-5dbf39dd510e.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1054: [HUDI-372] Support the shortName for Hudi DataSource

2019-11-27 Thread GitBox
lamber-ken commented on issue #1054: [HUDI-372] Support the shortName for Hudi 
DataSource
URL: https://github.com/apache/incubator-hudi/pull/1054#issuecomment-559321835
 
 
   I am trying to find out why ci build failed, waiting.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-373] Refactor hudi-client based on new ImportOrder code style rule (#1056)

2019-11-27 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new da8d133  [HUDI-373] Refactor hudi-client based on new ImportOrder code 
style rule (#1056)
da8d133 is described below

commit da8d1334eecd65f2896613b1143d6ba31bae03b7
Author: lamber-ken 
AuthorDate: Thu Nov 28 09:25:56 2019 +0800

[HUDI-373] Refactor hudi-client based on new ImportOrder code style rule 
(#1056)
---
 .../java/org/apache/hudi/AbstractHoodieClient.java |  8 ++--
 .../org/apache/hudi/CompactionAdminClient.java | 28 +++--
 .../java/org/apache/hudi/HoodieCleanClient.java| 12 +++---
 .../java/org/apache/hudi/HoodieReadClient.java | 11 +++--
 .../java/org/apache/hudi/HoodieWriteClient.java| 33 ---
 .../src/main/java/org/apache/hudi/WriteStatus.java |  9 ++--
 .../client/embedded/EmbeddedTimelineService.java   |  6 ++-
 .../org/apache/hudi/client/utils/ClientUtils.java  |  1 +
 .../apache/hudi/config/HoodieCompactionConfig.java | 13 +++---
 .../apache/hudi/config/HoodieHBaseIndexConfig.java |  3 +-
 .../org/apache/hudi/config/HoodieIndexConfig.java  |  7 +++-
 .../org/apache/hudi/config/HoodieMemoryConfig.java |  8 ++--
 .../apache/hudi/config/HoodieMetricsConfig.java|  6 ++-
 .../org/apache/hudi/config/HoodieWriteConfig.java  | 19 +
 .../apache/hudi/func/BulkInsertMapFunction.java|  6 ++-
 .../hudi/func/CopyOnWriteLazyInsertIterable.java   | 14 ---
 .../hudi/func/MergeOnReadLazyInsertIterable.java   |  7 ++--
 .../java/org/apache/hudi/func/OperationResult.java |  3 +-
 .../apache/hudi/func/ParquetReaderIterator.java|  6 ++-
 .../hudi/func/SparkBoundedInMemoryExecutor.java|  6 ++-
 .../java/org/apache/hudi/index/HoodieIndex.java|  4 +-
 .../org/apache/hudi/index/InMemoryHashIndex.java   | 12 +++---
 .../hudi/index/bloom/BloomIndexFileInfo.java   |  1 +
 .../bloom/BucketizedBloomCheckPartitioner.java | 10 +++--
 .../apache/hudi/index/bloom/HoodieBloomIndex.java  | 23 ++-
 .../index/bloom/HoodieBloomIndexCheckFunction.java |  9 ++--
 .../hudi/index/bloom/HoodieGlobalBloomIndex.java   | 17 
 .../hbase/DefaultHBaseQPSResourceAllocator.java|  1 +
 .../org/apache/hudi/index/hbase/HBaseIndex.java| 41 +-
 .../org/apache/hudi/io/HoodieAppendHandle.java | 22 +-
 .../java/org/apache/hudi/io/HoodieCleanHelper.java | 16 
 .../org/apache/hudi/io/HoodieCommitArchiveLog.java | 30 +++---
 .../org/apache/hudi/io/HoodieCreateHandle.java | 12 +++---
 .../java/org/apache/hudi/io/HoodieIOHandle.java|  3 +-
 .../org/apache/hudi/io/HoodieKeyLookupHandle.java  | 14 ---
 .../java/org/apache/hudi/io/HoodieMergeHandle.java | 20 +
 .../org/apache/hudi/io/HoodieRangeInfoHandle.java  |  3 +-
 .../java/org/apache/hudi/io/HoodieReadHandle.java  |  3 +-
 .../java/org/apache/hudi/io/HoodieWriteHandle.java | 14 ---
 .../apache/hudi/io/compact/HoodieCompactor.java|  8 ++--
 .../io/compact/HoodieRealtimeTableCompactor.java   | 32 ---
 .../strategy/BoundedIOCompactionStrategy.java  |  6 ++-
 .../BoundedPartitionAwareCompactionStrategy.java   |  8 ++--
 .../io/compact/strategy/CompactionStrategy.java| 10 +++--
 .../strategy/DayBasedCompactionStrategy.java   | 10 +++--
 .../LogFileSizeBasedCompactionStrategy.java|  9 ++--
 .../strategy/UnBoundedCompactionStrategy.java  |  3 +-
 .../UnBoundedPartitionAwareCompactionStrategy.java |  7 ++--
 .../hudi/io/storage/HoodieParquetConfig.java   |  3 +-
 .../hudi/io/storage/HoodieParquetWriter.java   | 16 
 .../hudi/io/storage/HoodieStorageWriter.java   |  6 ++-
 .../io/storage/HoodieStorageWriterFactory.java | 16 
 .../org/apache/hudi/metrics/HoodieMetrics.java |  5 ++-
 .../main/java/org/apache/hudi/metrics/Metrics.java |  8 ++--
 .../hudi/metrics/MetricsGraphiteReporter.java  |  8 ++--
 .../hudi/metrics/MetricsReporterFactory.java   |  3 +-
 .../apache/hudi/table/HoodieCopyOnWriteTable.java  | 37 +
 .../apache/hudi/table/HoodieMergeOnReadTable.java  | 26 ++--
 .../java/org/apache/hudi/table/HoodieTable.java| 26 ++--
 .../org/apache/hudi/table/RollbackExecutor.java| 27 ++--
 .../table/UserDefinedBulkInsertPartitioner.java|  1 +
 .../org/apache/hudi/table/WorkloadProfile.java | 11 +++--
 .../java/org/apache/hudi/table/WorkloadStat.java   |  5 ++-
 hudi-client/src/test/java/HoodieClientExample.java | 14 ---
 .../org/apache/hudi/HoodieClientTestHarness.java   | 24 ++-
 .../java/org/apache/hudi/TestAsyncCompaction.java  | 28 +++--
 .../src/test/java/org/apache/hudi/TestCleaner.java | 48 +++---
 .../java/org/apache/hudi/TestClientRollback.java   | 18 
 

[GitHub] [incubator-hudi] leesf merged pull request #1056: [HUDI-373] Refactor hudi-client based on new ImportOrder code style rule

2019-11-27 Thread GitBox
leesf merged pull request #1056: [HUDI-373] Refactor hudi-client based on new 
ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1056
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-372) Support the shortName for Hudi DataSource

2019-11-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-372:

Summary: Support the shortName for Hudi DataSource  (was: Support the 
shortName for Hudi DataSouce)

> Support the shortName for Hudi DataSource
> -
>
> Key: HUDI-372
> URL: https://issues.apache.org/jira/browse/HUDI-372
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fix the shortName of DataSouce, after this issue, we can use this command 
> like this
> {code:java}
> spark.read.format("hudi")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1054: [HUDI-372] Support the shortName for Hudi DataSource

2019-11-27 Thread GitBox
lamber-ken commented on a change in pull request #1054: [HUDI-372] Support the 
shortName for Hudi DataSource
URL: https://github.com/apache/incubator-hudi/pull/1054#discussion_r351552235
 
 

 ##
 File path: 
hudi-spark/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
 ##
 @@ -0,0 +1,19 @@
+
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+org.apache.hudi.DefaultSource
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1054: [HUDI-372] Support the shortName for Hudi DataSource

2019-11-27 Thread GitBox
lamber-ken commented on a change in pull request #1054: [HUDI-372] Support the 
shortName for Hudi DataSource
URL: https://github.com/apache/incubator-hudi/pull/1054#discussion_r351551514
 
 

 ##
 File path: hudi-spark/src/test/scala/TestDataSource.scala
 ##
 @@ -63,6 +63,19 @@ class TestDataSource extends AssertionsForJUnit {
 fs = FSUtils.getFs(basePath, spark.sparkContext.hadoopConfiguration)
   }
 
+  @Test def testShotNameStorage() {
 
 Review comment:
   good catch


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-11-27 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984047#comment-16984047
 ] 

Raymond Xu commented on HUDI-309:
-

[~nicholasjiang] I think the design in "Description" is meant for treating 
 pair as an event (like a message consumed from Kafka) 
and then append it to some metadata log files, which will then be compacted 
into HFile for future query purpose. I think for the same active timeline, 
commitTime is the natural identifier for commit metadata. ([~vinoth] [~vbalaji] 
please correct me if I misunderstand it)

So far my concern for this design is in the action type partition: as the 
action types are fixed to a small number, the files under each type partition 
will keep growing. Eventually if too many files accumulated under a particular 
partition (say "commit/"), would that cause issue on scanning?

How about, as also [~nicholasjiang] points out, partitioning by "commitTime" 
(converted to "/MM/dd" or "/MM" or configurable) that would set certain 
upper bound to the number of files?

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] tooptoop4 commented on issue #750: Adding support for optional skipping single archiving failures

2019-11-27 Thread GitBox
tooptoop4 commented on issue #750: Adding support for optional skipping single 
archiving failures
URL: https://github.com/apache/incubator-hudi/pull/750#issuecomment-559253002
 
 
   @RonBarabash  does this fix issue with 0 byte .commit file? also how are u 
running hudi with spark 2.4? did u change spark version reference in hudi 
pom.xm!?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-11-27 Thread GitBox
pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load 
to HUDI DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-559230742
 
 
   Hi @vinothchandar and @taherk77 
   I would like to add 2 points to this feature to make this very generic
   
   - [ ] We might need support for combination of incrementing columns. 
Incrementing columns can be of below types 
   1. Timestamp columns
   2. Auto Incrementing column
   3. Timestamp + Auto Incrementing.
   Instead of code figuring out the incremental pull strategy, it'll be better 
if user provides it as config for each table.
   Considering Timestamp incrementing column, there can be more than once 
column contributing to this strategy. e.g. When a row is creation, only 
`created_at` column is set and `updated_at` is null by default. When the same 
row is updated, `updated_at` gets assigned to some timestamp. In such cases it 
is wise to consider both columns in the query formation. 
   
   - [ ] We need to sort rows according to above mentioned incrementing columns 
to fetch rows in chunks (you can make use of `defaultFetchSize` in MySQL). I'm 
aware that sorting adds load on Database, but it helps in tracking the last 
pulled timestamp or auto incrementing id and help retry/resume from the point 
last recorded. This will be a saviour during failures.
   
   A sample MySQL query for incrementing timestamp columns as (`created_at` and 
`updated_at`)  might look like 
   `SELECT * FROM inventory.customers WHERE 
COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > 
$last_recorder_time AND 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < 
$current_time ORDER BY 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-11-27 Thread GitBox
pushpavanthar edited a comment on issue #969: [HUDI-251] JDBC incremental load 
to HUDI DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-559230742
 
 
   I would like to add 2 points to this feature to make this very generic
   
   - [ ] We might need support for combination of incrementing columns. 
Incrementing columns can be of below types 
   1. Timestamp columns
   2. Auto Incrementing column
   3. Timestamp + Auto Incrementing.
   Instead of code figuring out the incremental pull strategy, it'll be better 
if user provides it as config for each table.
   Considering Timestamp incrementing column, there can be more than once 
column contributing to this strategy. e.g. When a row is creation, only 
`created_at` column is set and `updated_at` is null by default. When the same 
row is updated, `updated_at` gets assigned to some timestamp. In such cases it 
is wise to consider both columns in the query formation. 
   
   - [ ] We need to sort rows according to above mentioned incrementing columns 
to fetch rows in chunks (you can make use of `defaultFetchSize` in MySQL). I'm 
aware that sorting adds load on Database, but it helps in tracking the last 
pulled timestamp or auto incrementing id and help retry/resume from the point 
last recorded. This will be a saviour during failures.
   
   A sample MySQL query for incrementing timestamp columns as (`created_at` and 
`updated_at`)  might look like 
   `SELECT * FROM inventory.customers WHERE 
COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > 
$last_recorder_time AND 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < 
$current_time ORDER BY 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pushpavanthar commented on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-11-27 Thread GitBox
pushpavanthar commented on issue #969: [HUDI-251] JDBC incremental load to HUDI 
DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-559230742
 
 
   I would like to add 2 points to this feature to make this very generic
   
   - [ ] We might need support for combination of more than one incrementing 
columns. Incrementing columns can be of below types 
   1. Timestamp column
   2. Auto Incrementing column
   3. Timestamp + Auto Incrementing.
   Instead of code figuring out the incremental pull strategy, it'll be better 
if user provide it from config for each table.
   When accepting Timestamp incrementing column, there can be more than once 
columns contributing to this strategy. e.g. During a row is creation only 
`created_at` column is set and let's say `updated_at` is null by default. When 
the same row is updated, `updated_at` gets assigned to some timestamp. In such 
scenarios its wise to consider both columns in your query formation. 
   
   - [ ] We need to sort rows according to above mentioned incrementing columns 
to fetch rows in chunks (you can make use of `defaultFetchSize` for MySQL). I 
understand this adds load on Database, but this tracks the last pulled 
timestamp or auto incrementing column and helps retry from that point for 
consecutive batches. This will be a saviour during failures. 
   
   A sample MySQL query for incrementing timestamp columns as (`created_at` and 
`updated_at`)  might look like 
   `SELECT * FROM inventory.customers WHERE 
COALESCE(inventory.customers.updated_at, inventory.customers.created_at) > 
$last_recorder_time AND 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) < 
$current_time ORDER BY 
COALESCE(inventory.customers.updated_at,inventory.customers.created_at) ASC`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #1021: how can i deal this problem when partition's value changed with the same row_key?

2019-11-27 Thread GitBox
nsivabalan commented on issue #1021: how can i deal this problem when 
partition's value changed with the same row_key? 
URL: https://github.com/apache/incubator-hudi/issues/1021#issuecomment-559224596
 
 
   @vinothchandar : I found the root cause.  
   
   within HoodieGlobalIndex#explodeRecordRDDWithFileComparisons
   
   ```
   JavaRDD> explodeRecordRDDWithFileComparisons(
 final Map> partitionToFileIndexInfo,
 JavaPairRDD partitionRecordKeyPairRDD) {
   .
   .
   .
return  partitionRecordKeyPairRDD.map(partitionRecordKeyPair -> {
String recordKey = partitionRecordKeyPair._2();
String partitionPath = partitionRecordKeyPair._1();
   
return indexFileFilter.getMatchingFiles(partitionPath, 
recordKey).stream()
 .map(file -> new Tuple2<>(file, new HoodieKey(recordKey, 
indexToPartitionMap.get(file
 .collect(Collectors.toList());
   }).flatMap(List::iterator);
   ```
   
   
   
   In this, indexFileFilter.getMatchingFiles(partitionPath, recordKey) returns 
fileId from Partition1, where as incoming record is tagged with Partition2. 
   
   So, this is what I am thinking as the fix. as of now, 
IndexFileFilter.getMatchingFiles(String partitionPath, String recordKey) is 
returning Sets. Instead, IndexFileFilter.getMatchingFiles(String 
partitionPath, String recordKey) should return Set> 
and we should attach that as below.
   ```
   return partitionRecordKeyPairRDD.map(partitionRecordKeyPair -> {
 String recordKey = partitionRecordKeyPair._2();
 String partitionPath = partitionRecordKeyPair._1();
   
 return indexFileFilter.getMatchingFiles(partitionPath, 
recordKey).stream()
 .map( (origPartitionPath, matchingFile) -> new 
Tuple2<>(matchingFile, new HoodieKey(recordKey, origPartitionPath)))
 .collect(Collectors.toList());
   }).flatMap(List::iterator);
   ```
   
   But how do we intimate the user that these records are updated with 
Partition1 and not Partition2 as per incoming records in upsert call? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bschell commented on issue #1052: [HUDI-326] Add new index to suppport global update/delete

2019-11-27 Thread GitBox
bschell commented on issue #1052: [HUDI-326] Add new index to suppport global 
update/delete
URL: https://github.com/apache/incubator-hudi/pull/1052#issuecomment-559222366
 
 
   I see, I think this feature could instead possibly be refactored into 
HoodieBloomIndex instead then as the logic is compatible. My thinking was that 
a user would use this Index only to scan extra partitions for global 
update/delete but could then use the normal HoodieBloomIndex by default for the 
improved performance. But that is more of a view than an index then.
   
   As it stands the current implementation of HoodieGlobalBloomIndex doesn't 
really guarantee globally unique keys either if an insert contains duplicates. 
This causes weird behaviors when records with duplicates are updated which is 
the reason why we created this Index.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1052: [HUDI-326] Add new index to suppport global update/delete

2019-11-27 Thread GitBox
vinothchandar commented on issue #1052: [HUDI-326] Add new index to suppport 
global update/delete
URL: https://github.com/apache/incubator-hudi/pull/1052#issuecomment-559208611
 
 
   > I have seen the JIRA that wants to set the HoodieIndex as part of the 
table properties. This conflicts with the usecase here. Can you help me 
understand the motivation of locking down the index type?
   
   The thinking was that if someone writes few commits using `HoodieBloomIndex` 
and then later switches to `HoodieGlobalBloomIndex` for e.g , then it won't 
guarantee globally unique keys as he/she might expect, since the first set of 
commits may have duplicates across partitions. If you take HBaseIndex,  that's 
another clear example, user cannot just switch from HoodieBloomIndex to HBase 
without rebuilding the index first?
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-118) Hudi CLI : Provide options for passing properties to Compactor, Cleaner and ParquetImporter

2019-11-27 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983739#comment-16983739
 ] 

Balaji Varadarajan commented on HUDI-118:
-

[~Pratyaksh] 

If you look at the referenced PR :  
[https://github.com/apache/incubator-hudi/pull/691], there is a new command 
line parameter "--hoodie-conf" that is added. The purpose of this option is to 
override the properties that we get by reading the DFS properties file.  Both 
DFS properties file and the overridden properties are used to construct objects 
like *HoodieWriteClient*. 

 

There are some specific CLI commands in the files (CleansCommand, 
CompactionCommand, HDFSParquetImporterCommand) which invoke the corresponding 
utilities scripts touched in [https://github.com/apache/incubator-hudi/pull/691]

Along with "--hoodie-conf" argument, you might also have to add "–props" 
argument as they both are missing in the hoodie-cli counterpart.

 

To your question on HDFSParquetImporterCommand, HDFSParquetImporter (in 
hudi-utilities) has the following command line arguments

```

@Parameter(names = \{"--props"}, description = "path to properties file on 
localfs or dfs, with configurations for "
 + "hoodie client for importing")
public String propsFilePath = null;
@Parameter(names = \{"--hoodie-conf"}, description = "Any configuration that 
can be set in the properties file "
 + "(using the CLI parameter \"--propsFilePath\") can also be passed command 
line using this parameter")
public List configs = new ArrayList<>();

``` 

These arguments are being used to construct HoodieWriteClient

 

```

HoodieWriteClient client =
 UtilHelpers.createHoodieClient(jsc, cfg.targetPath, schemaStr, 
cfg.parallelism, Option.empty(), props);

```

We would need both --props and --hoodie-conf to be passed from 
HDFSParquetImportCommand.convert. 

 

 

 

 

> Hudi CLI : Provide options for passing properties to Compactor, Cleaner and 
> ParquetImporter 
> 
>
> Key: HUDI-118
> URL: https://issues.apache.org/jira/browse/HUDI-118
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI, newbie
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Minor
>
> For non-trivial CLI operations, we have a standalone script in hudi-utilities 
> that users can call directly using spark-submit (usually). We also have 
> commands in hudi-cli to invoke the commands directly from hudi-cli shell.
> There was an earlier effort to allow users to pass properties directly to the 
> scripts in hudi-utilities but we still need to give the same functionality to 
> the corresponding commands in hudi-cli.
> In hudi-cli, Compaction (schedule/compact), Cleaner and HDFSParquetImporter 
> command does not have option to pass DFS properties file. This is a followup 
> to PR [https://github.com/apache/incubator-hudi/pull/691]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1054: [HUDI-372] Support the shortName for Hudi DataSource

2019-11-27 Thread GitBox
vinothchandar commented on a change in pull request #1054: [HUDI-372] Support 
the shortName for Hudi DataSource
URL: https://github.com/apache/incubator-hudi/pull/1054#discussion_r351405009
 
 

 ##
 File path: 
hudi-spark/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
 ##
 @@ -0,0 +1,19 @@
+
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+org.apache.hudi.DefaultSource
 
 Review comment:
   nit: new line


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1054: [HUDI-372] Support the shortName for Hudi DataSource

2019-11-27 Thread GitBox
vinothchandar commented on a change in pull request #1054: [HUDI-372] Support 
the shortName for Hudi DataSource
URL: https://github.com/apache/incubator-hudi/pull/1054#discussion_r351404770
 
 

 ##
 File path: hudi-spark/src/test/scala/TestDataSource.scala
 ##
 @@ -63,6 +63,19 @@ class TestDataSource extends AssertionsForJUnit {
 fs = FSUtils.getFs(basePath, spark.sparkContext.hadoopConfiguration)
   }
 
+  @Test def testShotNameStorage() {
 
 Review comment:
   typo: short


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-373) Refactor hudi-client based on new ImportOrder code style rule

2019-11-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-373:

Labels: pull-request-available  (was: )

>  Refactor hudi-client based on new ImportOrder code style rule
> --
>
> Key: HUDI-373
> URL: https://issues.apache.org/jira/browse/HUDI-373
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1056: [HUDI-373] Refactor hudi-client based on new ImportOrder code style rule

2019-11-27 Thread GitBox
lamber-ken opened a new pull request #1056: [HUDI-373] Refactor hudi-client 
based on new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1056
 
 
   ## What is the purpose of the pull request
   
   Refactor hudi-client based on new ImportOrder code style rule
   
   ## Brief change log
   
 - Refactor hudi-client based on new ImportOrder code style rule.
   
   ## Verify this pull request
   
   This pull request is a code cleanup without any test coverage.
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [x] CI is green
   
- [x] Necessary doc changes done or have another open PR
  
- [x] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Resolved] (HUDI-366) Refactor some module codes based on new ImportOrder codestyle rule

2019-11-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken resolved HUDI-366.
-
Resolution: Resolved

> Refactor some module codes based on new ImportOrder codestyle rule
> --
>
> Key: HUDI-366
> URL: https://issues.apache.org/jira/browse/HUDI-366
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Refactor some module codes based on new ImportOrder codestyle rule
>  
> hudi-hadoop-mr
> hudi-timeline-service
> hudi-spark
> hudi-integ-test
> hudi- utilities
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated: [HUDI-366] Refactor some module codes based on new ImportOrder code style rule (#1055)

2019-11-27 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f9139c0  [HUDI-366] Refactor some module codes based on new 
ImportOrder code style rule (#1055)
f9139c0 is described below

commit f9139c0f616775f4b3d0df95772f2621f0e7c9f1
Author: 谢磊 
AuthorDate: Wed Nov 27 21:32:43 2019 +0800

[HUDI-366] Refactor some module codes based on new ImportOrder code style 
rule (#1055)

[HUDI-366] Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / 
hudi-integ-test / hudi- utilities based on new ImportOrder code style rule
---
 .../hudi/hadoop/HoodieParquetInputFormat.java  | 34 +++--
 .../hudi/hadoop/HoodieROTablePathFilter.java   | 20 
 .../hudi/hadoop/RecordReaderValueIterator.java | 10 ++--
 .../hadoop/SafeParquetRecordReaderWrapper.java |  3 +-
 .../hadoop/hive/HoodieCombineHiveInputFormat.java  | 38 +++---
 .../realtime/AbstractRealtimeRecordReader.java | 34 +++--
 .../realtime/HoodieParquetRealtimeInputFormat.java | 46 +
 .../hadoop/realtime/HoodieRealtimeFileSplit.java   |  3 +-
 .../realtime/HoodieRealtimeRecordReader.java   |  6 ++-
 .../realtime/RealtimeCompactedRecordReader.java| 18 ---
 .../realtime/RealtimeUnmergedRecordReader.java | 20 
 .../apache/hudi/hadoop/InputFormatTestUtil.java| 28 +-
 .../org/apache/hudi/hadoop/TestAnnotation.java |  5 +-
 .../apache/hudi/hadoop/TestHoodieInputFormat.java  | 10 ++--
 .../hudi/hadoop/TestHoodieROTablePathFilter.java   | 16 +++---
 .../hudi/hadoop/TestRecordReaderValueIterator.java | 12 +++--
 .../realtime/TestHoodieRealtimeRecordReader.java   | 59 --
 .../java/org/apache/hudi/integ/ITTestBase.java | 18 ---
 .../org/apache/hudi/integ/ITTestHoodieDemo.java|  6 ++-
 .../org/apache/hudi/integ/ITTestHoodieSanity.java  |  1 +
 .../main/java/org/apache/hudi/BaseAvroPayload.java |  8 +--
 .../java/org/apache/hudi/ComplexKeyGenerator.java  |  8 +--
 .../main/java/org/apache/hudi/DataSourceUtils.java | 18 ---
 .../org/apache/hudi/HoodieDataSourceHelpers.java   | 10 ++--
 .../main/java/org/apache/hudi/KeyGenerator.java|  6 ++-
 .../apache/hudi/NonpartitionedKeyGenerator.java|  3 +-
 .../hudi/OverwriteWithLatestAvroPayload.java   | 10 ++--
 .../main/java/org/apache/hudi/QuickstartUtils.java | 18 ---
 .../java/org/apache/hudi/SimpleKeyGenerator.java   |  3 +-
 hudi-spark/src/test/java/DataSourceTestUtils.java  |  7 +--
 hudi-spark/src/test/java/HoodieJavaApp.java| 12 +++--
 .../src/test/java/HoodieJavaStreamingApp.java  | 18 ---
 .../timeline/service/FileSystemViewHandler.java| 24 +
 .../hudi/timeline/service/TimelineService.java | 16 +++---
 .../timeline/service/handlers/DataFileHandler.java |  8 +--
 .../service/handlers/FileSliceHandler.java | 12 +++--
 .../hudi/timeline/service/handlers/Handler.java|  6 ++-
 .../timeline/service/handlers/TimelineHandler.java | 10 ++--
 .../view/TestRemoteHoodieTableFileSystemView.java  |  1 +
 .../apache/hudi/utilities/HDFSParquetImporter.java | 43 
 .../hudi/utilities/HiveIncrementalPuller.java  | 32 ++--
 .../org/apache/hudi/utilities/HoodieCleaner.java   | 18 ---
 .../hudi/utilities/HoodieCompactionAdminTool.java  | 18 ---
 .../org/apache/hudi/utilities/HoodieCompactor.java | 16 +++---
 .../hudi/utilities/HoodieSnapshotCopier.java   | 25 +
 .../hudi/utilities/HoodieWithTimelineServer.java   | 11 ++--
 .../org/apache/hudi/utilities/UtilHelpers.java | 26 +-
 .../adhoc/UpgradePayloadFromUberToApache.java  | 20 
 .../AbstractDeltaStreamerService.java  |  8 +--
 .../hudi/utilities/deltastreamer/Compactor.java|  6 ++-
 .../hudi/utilities/deltastreamer/DeltaSync.java| 42 ---
 .../deltastreamer/HoodieDeltaStreamer.java | 50 +-
 .../deltastreamer/HoodieDeltaStreamerMetrics.java  |  3 +-
 .../deltastreamer/SchedulerConfGenerator.java  | 12 +++--
 .../deltastreamer/SourceFormatAdapter.java | 11 ++--
 .../exception/HoodieIncrementalPullException.java  |  3 +-
 .../keygen/TimestampBasedKeyGenerator.java | 18 ---
 .../hudi/utilities/perf/TimelineServerPerf.java| 36 ++---
 .../utilities/schema/FilebasedSchemaProvider.java  | 12 +++--
 .../schema/NullTargetSchemaRegistryProvider.java   |  3 +-
 .../utilities/schema/RowBasedSchemaProvider.java   |  3 +-
 .../hudi/utilities/schema/SchemaProvider.java  |  6 ++-
 .../utilities/schema/SchemaRegistryProvider.java   | 12 +++--
 .../hudi/utilities/sources/AvroDFSSource.java  |  9 ++--
 .../hudi/utilities/sources/AvroKafkaSource.java|  7 +--
 .../apache/hudi/utilities/sources/AvroSource.java  |  3 +-
 

[GitHub] [incubator-hudi] yanghua merged pull request #1055: [HUDI-366] Refactor some module codes based on new ImportOrder code style rule

2019-11-27 Thread GitBox
yanghua merged pull request #1055: [HUDI-366] Refactor some module codes based 
on new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1055
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-366) Refactor some module codes based on new ImportOrder codestyle rule

2019-11-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-366:

Summary: Refactor some module codes based on new ImportOrder codestyle rule 
 (was: Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark  / 
hudi-integ-test based on new ImportOrder code style rule)

> Refactor some module codes based on new ImportOrder codestyle rule
> --
>
> Key: HUDI-366
> URL: https://issues.apache.org/jira/browse/HUDI-366
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Refactor some module codes based on new ImportOrder codestyle rule
>  
> hudi-hadoop-mr
> hudi-timeline-service
> hudi-spark
> hudi-integ-test
> hudi- utilities
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-366) Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / hudi-integ-test based on new ImportOrder code style rule

2019-11-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-366:

Description: 
Refactor some module codes based on new ImportOrder codestyle rule

 

hudi-hadoop-mr

hudi-timeline-service

hudi-spark

hudi-integ-test

hudi- utilities

 

  was:Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / 
hudi-integ-test based on new ImportOrder code style rule


> Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark  / 
> hudi-integ-test based on new ImportOrder code style rule
> 
>
> Key: HUDI-366
> URL: https://issues.apache.org/jira/browse/HUDI-366
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Refactor some module codes based on new ImportOrder codestyle rule
>  
> hudi-hadoop-mr
> hudi-timeline-service
> hudi-spark
> hudi-integ-test
> hudi- utilities
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1055: [HUDI-366] Refactor some mudule codes based on new ImportOrder code style rule

2019-11-27 Thread GitBox
lamber-ken commented on issue #1055: [HUDI-366] Refactor some mudule codes 
based on new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1055#issuecomment-559064720
 
 
   hi, @yanghua help to review, thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-118) Hudi CLI : Provide options for passing properties to Compactor, Cleaner and ParquetImporter

2019-11-27 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983461#comment-16983461
 ] 

Pratyaksh Sharma commented on HUDI-118:
---

[~vbalaji] Yes the description is now clear. I am trying to understand the flow 
of the concerned commands (CleansCommand, CompactionCommand, 
HDFSParquetImporterCommand). IIUC, DFS properties file is needed to initiate 
HoodieTableMetaClient in CleansCommand and CompactionCommand since in these 
two, it is never getting initialised. 

However I am not able to understand the use case of adding the same in 
HDFSParquetImporterCommand. Would be great if you could help me here with your 
valuable inputs. 

> Hudi CLI : Provide options for passing properties to Compactor, Cleaner and 
> ParquetImporter 
> 
>
> Key: HUDI-118
> URL: https://issues.apache.org/jira/browse/HUDI-118
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI, newbie
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Minor
>
> For non-trivial CLI operations, we have a standalone script in hudi-utilities 
> that users can call directly using spark-submit (usually). We also have 
> commands in hudi-cli to invoke the commands directly from hudi-cli shell.
> There was an earlier effort to allow users to pass properties directly to the 
> scripts in hudi-utilities but we still need to give the same functionality to 
> the corresponding commands in hudi-cli.
> In hudi-cli, Compaction (schedule/compact), Cleaner and HDFSParquetImporter 
> command does not have option to pass DFS properties file. This is a followup 
> to PR [https://github.com/apache/incubator-hudi/pull/691]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-373) Refactor hudi-client based on new ImportOrder code style rule

2019-11-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-373:

Summary:  Refactor hudi-client based on new ImportOrder code style rule  
(was:  1. Refactor hudi-client based on new ImportOrder code style rule)

>  Refactor hudi-client based on new ImportOrder code style rule
> --
>
> Key: HUDI-373
> URL: https://issues.apache.org/jira/browse/HUDI-373
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-373) 1. Refactor hudi-client based on new ImportOrder code style rule

2019-11-27 Thread lamber-ken (Jira)
lamber-ken created HUDI-373:
---

 Summary:  1. Refactor hudi-client based on new ImportOrder code 
style rule
 Key: HUDI-373
 URL: https://issues.apache.org/jira/browse/HUDI-373
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: lamber-ken






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-369) Refactor hudi-utilities based on new ImportOrder code style rule

2019-11-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed HUDI-369.
---
Resolution: Done

> Refactor hudi-utilities based on new ImportOrder code style rule
> 
>
> Key: HUDI-369
> URL: https://issues.apache.org/jira/browse/HUDI-369
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-366) Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / hudi-integ-test based on new ImportOrder code style rule

2019-11-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-366:

Labels: pull-request-available  (was: )

> Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark  / 
> hudi-integ-test based on new ImportOrder code style rule
> 
>
> Key: HUDI-366
> URL: https://issues.apache.org/jira/browse/HUDI-366
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
>
> Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / 
> hudi-integ-test based on new ImportOrder code style rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1055: [HUDI-366] Refactor some mudule codes based on new ImportOrder code style rule

2019-11-27 Thread GitBox
lamber-ken opened a new pull request #1055: [HUDI-366] Refactor some mudule 
codes based on new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1055
 
 
   ## What is the purpose of the pull request
   
   Refactor some mudule codes based on new ImportOrder code style rule.
   
   - hudi-hadoop-mr
   - hudi-timeline-service
   - hudi-spark
   - hudi-integ-test
   - hudi- utilities
   
   ## Brief change log
   
 - Modify some module codes
   
   ## Verify this pull request
   
   This pull request is a trivial rework code cleanup without any test coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-372) Support the shortName for Hudi DataSouce

2019-11-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-372:

Labels: pull-request-available  (was: )

> Support the shortName for Hudi DataSouce
> 
>
> Key: HUDI-372
> URL: https://issues.apache.org/jira/browse/HUDI-372
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>
> Fix the shortName of DataSouce, after this issue, we can use this command 
> like this
> {code:java}
> spark.read.format("hudi")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1054: [HUDI-372] Support the shortName for Hudi DataSouce

2019-11-27 Thread GitBox
lamber-ken opened a new pull request #1054: [HUDI-372] Support the shortName 
for Hudi DataSouce
URL: https://github.com/apache/incubator-hudi/pull/1054
 
 
   ## What is the purpose of the pull request
   
   Support the shortName for Hudi DataSouce, we can use shortName like
   ```
   val updates = convertToStringList(dataGen.generateUpdates(10))
   val df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
   df.write.format("hudi").
   options(getQuickstartWriteConfigs).
   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
   option(TABLE_NAME, tableName).
   mode(Append).
   save(basePath);
   ```
   
   ## Brief change log
   
 - Add org.apache.spark.sql.sources.DataSourceRegister to META-INF/services
 - Add test
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as 
`TestDataSource#testShotNameStorage`.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-372) Support the shortName for Hudi DataSouce

2019-11-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-372:

Summary: Support the shortName for Hudi DataSouce  (was: Fix the shortName 
of DataSouce)

> Support the shortName for Hudi DataSouce
> 
>
> Key: HUDI-372
> URL: https://issues.apache.org/jira/browse/HUDI-372
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Fix the shortName of DataSouce, after this issue, we can use this command 
> like this
> {code:java}
> spark.read.format("hudi")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-184) Integrate Hudi with Apache Flink

2019-11-27 Thread vinoyang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983313#comment-16983313
 ] 

vinoyang commented on HUDI-184:
---

[~vinoth] I wrote more thought in a Google doc, please see here: 
[https://docs.google.com/document/d/1wDTpzuehfcDYMTD3c-lJqIiP1rt095XBzdZoPYMTvlY/edit?usp=sharing]

Please feel free to share your comments.

> Integrate Hudi with Apache Flink
> 
>
> Key: HUDI-184
> URL: https://issues.apache.org/jira/browse/HUDI-184
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Write Client
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Apache Flink is a popular streaming processing engine.
> Integrating Hudi with Flink is a valuable work.
> The discussion mailing thread is here: 
> [https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-11-27 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983307#comment-16983307
 ] 

Pratyaksh Sharma commented on HUDI-288:
---

[~xleesf] Here is what I have done - 

 

Basically we want to have single instance running for multiple tables in one 
go. But Hudi supports single topic and single target path with 1-1 mapping. To 
overcome this, we wrote a wrapper to incorporate the following topic/table 
overrides ->

1. Have separate topic for each table
2. Have separate record key and partition path for each table
3. Have separate schema-registry url for each table
4. Have separate hive_sync database and hive_sync table for every table
5. Support customised key generators for every table based on how the partition 
path field is formatted and also specify the same format as a config
6. Target base path has a one to one mapping with topic. To achieve this, 
target path was designed using the format -

// (this 
customization was done in wrapper after reading the args passed with 
spark-submit command)

7. Similar format is adopted for target_table_name as well, which is used at 
the time of registering the metrics.

Custom POJO TableConfig is used to maintain all table/topic specific properties 
and the corresponding JSON objects are written in the form of a list in a 
separate file. The same needs to be passed to spark-submit command using 
--files option. Wrapper reads this file and iterates over all the TableConfig 
objects. It creates one HoodieDeltaStreamer instance for each such object and 
hence does the ingestion for every topic running in a loop. This wrapper is 
scheduled via oozie. Here is a sample TableConfig object -

{
 "id": 
 "record_key_field": "",
 "partition_key_field": "",
 "kafka_topic": "",
 "partition_input_format": "-MM-dd HH:mm:ss.S"
}

We can keep on adding more topic specific overrides as per our need and use 
case. This design supports ingestion should run once for any table and not in 
continuous mode. We can discuss further to see how we can support continuous 
mode for all the tables using this wrapper. 

Please let me know if this makes sense. 

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-372) Fix the shortName of DataSouce

2019-11-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-372:
---

Assignee: lamber-ken

> Fix the shortName of DataSouce
> --
>
> Key: HUDI-372
> URL: https://issues.apache.org/jira/browse/HUDI-372
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Fix the shortName of DataSouce, after this issue, we can use this command 
> like this
> {code:java}
> spark.read.format("hudi")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-372) Fix the shortName of DataSouce

2019-11-27 Thread lamber-ken (Jira)
lamber-ken created HUDI-372:
---

 Summary: Fix the shortName of DataSouce
 Key: HUDI-372
 URL: https://issues.apache.org/jira/browse/HUDI-372
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: lamber-ken


Fix the shortName of DataSouce, after this issue, we can use this command like 
this
{code:java}
spark.read.format("hudi")
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-364) Refactor hudi-hive based on new ImportOrder code style rule

2019-11-27 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang resolved HUDI-364.
---
Resolution: Done

Done via b77fad39b53049279bba1ba367f43cce73f48bfa

> Refactor hudi-hive based on new ImportOrder code style rule
> ---
>
> Key: HUDI-364
> URL: https://issues.apache.org/jira/browse/HUDI-364
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Refactor hudi-hive based on new ImportOrder code style rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua merged pull request #1048: [HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule

2019-11-27 Thread GitBox
yanghua merged pull request #1048: [HUDI-364] Refactor hudi-hive based on new 
ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1048
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (d6e83e8 -> b77fad3)

2019-11-27 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from d6e83e8  [HUDI-325] Fix Hive partition error for updated HDFS Hudi 
table (#1001)
 add b77fad3  [HUDI-364] Refactor hudi-hive based on new ImportOrder code 
style rule (#1048)

No new revisions were added by this update.

Summary of changes:
 .../hudi/common/util/collection/DiskBasedMap.java  |  3 +-
 .../java/org/apache/hudi/hive/HiveSyncConfig.java  |  1 +
 .../java/org/apache/hudi/hive/HiveSyncTool.java| 21 +-
 .../org/apache/hudi/hive/HoodieHiveClient.java |  4 +-
 .../hudi/hive/MultiPartKeysValueExtractor.java |  1 +
 .../org/apache/hudi/hive/SchemaDifference.java |  3 +-
 .../SlashEncodedDayPartitionValueExtractor.java|  3 +-
 .../apache/hudi/hive/util/ColumnNameXLator.java|  1 +
 .../java/org/apache/hudi/hive/util/SchemaUtil.java | 22 +-
 .../org/apache/hudi/hive/TestHiveSyncTool.java | 24 ++-
 .../test/java/org/apache/hudi/hive/TestUtil.java   | 48 +++---
 .../org/apache/hudi/hive/util/HiveTestService.java | 20 +
 .../adhoc/UpgradePayloadFromUberToApache.java  |  2 +-
 style/checkstyle.xml   |  2 +-
 14 files changed, 84 insertions(+), 71 deletions(-)



[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1048: [HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule

2019-11-27 Thread GitBox
yanghua commented on a change in pull request #1048: [HUDI-364] Refactor 
hudi-hive based on new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1048#discussion_r351141351
 
 

 ##
 File path: style/checkstyle.xml
 ##
 @@ -34,7 +34,7 @@
 
 
 
-
+
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1048: [HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule

2019-11-27 Thread GitBox
lamber-ken commented on a change in pull request #1048: [HUDI-364] Refactor 
hudi-hive based on new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1048#discussion_r351140686
 
 

 ##
 File path: style/checkstyle.xml
 ##
 @@ -34,7 +34,7 @@
 
 
 
-
+
 
 Review comment:
   No impact


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1048: [HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule

2019-11-27 Thread GitBox
yanghua commented on a change in pull request #1048: [HUDI-364] Refactor 
hudi-hive based on new ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1048#discussion_r351139023
 
 

 ##
 File path: style/checkstyle.xml
 ##
 @@ -34,7 +34,7 @@
 
 
 
-
+
 
 Review comment:
   If we change this level, whether it would break the build progress for those 
modules which do not refactor the import orders currently?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1048: [HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule

2019-11-27 Thread GitBox
lamber-ken commented on issue #1048: [HUDI-364] Refactor hudi-hive based on new 
ImportOrder code style rule
URL: https://github.com/apache/incubator-hudi/pull/1048#issuecomment-558974640
 
 
   @yanghua hi, build ok.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services