date:20210325

[jira] [Closed] (HUDI-982) Make flink engine support MOR table

2021-03-25 Thread Xianghu Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghu Wang closed HUDI-982.
-
Resolution: Duplicate

> Make flink engine support MOR table
> ---
>
> Key: HUDI-982
> URL: https://issues.apache.org/jira/browse/HUDI-982
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu#1
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
>
> Make flink engine support MOR table



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] codecov-io edited a comment on pull request #2718: [HUDI-1495] Bump Flink version to 1.12.2

2021-03-25 Thread GitBox



codecov-io edited a comment on pull request #2718:
URL: https://github.com/apache/hudi/pull/2718#issuecomment-807923018


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=h1) Report
   > Merging 
[#2718](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=desc) (f6f0f75) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50?el=desc)
 (6e803e0) will **increase** coverage by `0.02%`.
   > The diff coverage is `32.69%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2718/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2718  +/-   ##
   
   + Coverage 51.72%   51.74%   +0.02% 
   - Complexity 3601 3606   +5 
   
 Files   476  476  
 Lines 2259522611  +16 
 Branches   2409 2410   +1 
   
   + Hits  1168711701  +14 
   + Misses 9889 9888   -1 
   - Partials   1019 1022   +3 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.94% <ø> (+0.01%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `54.18% <32.69%> (+0.09%)` | `0.00 <13.00> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `70.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `45.58% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.73% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...ache/hudi/sink/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JDb29yZGluYXRvci5qYXZh)
 | `68.94% <0.00%> (-0.44%)` | `32.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/streamer/HoodieFlinkStreamerV2.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zdHJlYW1lci9Ib29kaWVGbGlua1N0cmVhbWVyVjIuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...in/java/org/apache/hudi/table/HoodieTableSink.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNpbmsuamF2YQ==)
 | `12.19% <2.77%> (-2.10%)` | `2.00 <1.00> (ø)` | |
   | 
[.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh)
 | `60.97% <14.70%> (-0.48%)` | `25.00 <2.00> (-3.00)` | |
   | 
[...va/org/apache/hudi/configuration/FlinkOptions.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9jb25maWd1cmF0aW9uL0ZsaW5rT3B0aW9ucy5qYXZh)
 | `84.05% <76.92%> (-0.56%)` | `11.00 <4.00> (+4.00)` | :arrow_down: |
   | 
[...java/org/apache/hudi/table/HoodieTableFactory.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZUZhY3RvcnkuamF2YQ==)
 | `88.09% <94.11%> (+15.36%)` | `14.00 <6.00> (+3.00)` | |
   | 
[...c/main/java/org/apache/hudi/util/StreamerUtil.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1N0cmVhbWVyVXRpbC5qYXZh)
 | `49.53% <100.00%> (ø)` | `17.00 <0.00> (+1.00)` | |
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | `79.68% <0.00%> (+1.56%)` | `26.00% <0.00%> (ø%)` | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io commented on pull request #2718: [HUDI-1495] Bump Flink version to 1.12.2

2021-03-25 Thread GitBox



codecov-io commented on pull request #2718:
URL: https://github.com/apache/hudi/pull/2718#issuecomment-807923018


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=h1) Report
   > Merging 
[#2718](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=desc) (f6f0f75) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50?el=desc)
 (6e803e0) will **increase** coverage by `0.09%`.
   > The diff coverage is `32.69%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2718/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2718  +/-   ##
   
   + Coverage 51.72%   51.81%   +0.09% 
   + Complexity 3601 3416 -185 
   
 Files   476  454  -22 
 Lines 2259521006-1589 
 Branches   2409 2255 -154 
   
   - Hits  1168710885 -802 
   + Misses 9889 9174 -715 
   + Partials   1019  947  -72 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.94% <ø> (+0.01%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `54.18% <32.69%> (+0.09%)` | `0.00 <13.00> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `70.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.73% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...ache/hudi/sink/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JDb29yZGluYXRvci5qYXZh)
 | `68.94% <0.00%> (-0.44%)` | `32.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/streamer/HoodieFlinkStreamerV2.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zdHJlYW1lci9Ib29kaWVGbGlua1N0cmVhbWVyVjIuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...in/java/org/apache/hudi/table/HoodieTableSink.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNpbmsuamF2YQ==)
 | `12.19% <2.77%> (-2.10%)` | `2.00 <1.00> (ø)` | |
   | 
[.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh)
 | `60.97% <14.70%> (-0.48%)` | `25.00 <2.00> (-3.00)` | |
   | 
[...va/org/apache/hudi/configuration/FlinkOptions.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9jb25maWd1cmF0aW9uL0ZsaW5rT3B0aW9ucy5qYXZh)
 | `84.05% <76.92%> (-0.56%)` | `11.00 <4.00> (+4.00)` | :arrow_down: |
   | 
[...java/org/apache/hudi/table/HoodieTableFactory.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZUZhY3RvcnkuamF2YQ==)
 | `88.09% <94.11%> (+15.36%)` | `14.00 <6.00> (+3.00)` | |
   | 
[...c/main/java/org/apache/hudi/util/StreamerUtil.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1N0cmVhbWVyVXRpbC5qYXZh)
 | `49.53% <100.00%> (ø)` | `17.00 <0.00> (+1.00)` | |
   | 
[.../org/apache/hudi/hive/NonPartitionedExtractor.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvTm9uUGFydGl0aW9uZWRFeHRyYWN0b3IuamF2YQ==)
 | | | |
   | 
[...apache/hudi/timeline/service/handlers/Handler.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvaGFuZGxlcnMvSGFuZGxlci5qYXZh)
 | | | |
   | ... and [21 
more](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree-more) | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please

[GitHub] [hudi] pengzhiwei2018 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



pengzhiwei2018 commented on pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#issuecomment-807917549


   > LGTM now. I mostly have just minor comments now which you can address.
   > 
   > Can you run these unit tests you added once with `-Pspark3` to make sure 
this is running seamlessly for Spark 3 ? The travis tests right now don't run 
the tests with Spark 3.
   
   Thanks for your review @umehrot2 !
   I will address these comments soon and test the compile with `-Pspark3`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a change in pull request #2719: [HUDI-1721] run_sync_tool support hive3

2021-03-25 Thread GitBox



danny0405 commented on a change in pull request #2719:
URL: https://github.com/apache/hudi/pull/2719#discussion_r601994200



##
File path: hudi-sync/hudi-hive-sync/run_sync_tool.sh
##
@@ -49,6 +49,15 @@ fi
 HIVE_JACKSON=`ls ${HIVE_HOME}/lib/jackson-*.jar | tr '\n' ':'`
 HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON
 
+HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'`
+if [ -n "$HIVE_CALCITE" ]; then
+HIVE_JARS=$HIVE_JARS:$HIVE_CALCITE
+fi
+HIVE_LIBFB303=`ls ${HIVE_HOME}/lib/libfb303-*.jar | tr '\n' ':'`
+if [ -n "$HIVE_LIBFB303" ]; then
+HIVE_JARS=$HIVE_JARS:$HIVE_LIBFB303

Review comment:
   Is it possible to distinguish between Hive2 and Hive3 and make different 
library dependency ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Xoln commented on a change in pull request #2520: [HUDI-1446] Support skip bootstrapIndex's init in abstract fs view init

2021-03-25 Thread GitBox



Xoln commented on a change in pull request #2520:
URL: https://github.com/apache/hudi/pull/2520#discussion_r601986858



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/NoOpBootstrapIndex.java
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bootstrap.index;
+
+import org.apache.hudi.common.model.BootstrapFileMapping;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+
+import java.util.List;
+
+/**
+ * No Op Bootstrap Index , which is a emtpy implement and not do anything.

Review comment:
   done

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/NoOpBootstrapIndex.java
##
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bootstrap.index;
+
+import org.apache.hudi.common.model.BootstrapFileMapping;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+
+import java.util.List;
+
+/**
+ * No Op Bootstrap Index , which is a emtpy implement and not do anything.
+ */
+public class NoOpBootstrapIndex extends BootstrapIndex {
+
+  public NoOpBootstrapIndex(HoodieTableMetaClient metaClient) {
+super(metaClient);
+  }
+
+  @Override
+  public IndexReader createReader() {
+return null;

Review comment:
   done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread GitBox



xiarixiaoyao commented on a change in pull request #2721:
URL: https://github.com/apache/hudi/pull/2721#discussion_r601949535



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
##
@@ -95,15 +103,24 @@ public boolean next(NullWritable aVoid, ArrayWritable 
arrayWritable) throws IOEx
 // TODO(NA): Invoke preCombine here by converting arrayWritable to 
Avro. This is required since the
 // deltaRecord may not be a full record and needs values of columns 
from the parquet
 Option rec;
-if (usesCustomPayload) {
-  rec = 
deltaRecordMap.get(key).getData().getInsertValue(getWriterSchema());
-} else {
-  rec = 
deltaRecordMap.get(key).getData().getInsertValue(getReaderSchema());
+rec = buildGenericRecordwithCustomPayload(deltaRecordMap.get(key));
+// If the record is not present, this is a delete record using an 
empty payload so skip this base record
+// and move to the next record
+while (!rec.isPresent()) {
+  // if current parquet reader has no record, return false
+  if (!this.parquetReader.next(aVoid, arrayWritable)) {

Review comment:
   @garyli1019  thanks for your replay.   no, we will not miss a 
record.  when we call this.parquetReader.next(aVoid, arrayWritable) and this 
function return true,  parquet reader wil fill the new record to arrayWritable 
auto(see parquet reader source code), 
   
   source code of parquet reader:
 public boolean next(final NullWritable key, final ArrayWritable value) 
throws IOException {
   
 
   if (value != null && arrValue.length == arrCurrent.length) {
 System.arraycopy(arrCurrent, 0, arrValue, 0, 
arrCurrent.length);
   } else {
 if (arrValue.length != arrCurrent.length) {
   throw new IOException("DeprecatedParquetHiveInput : size of 
object differs. Value" +
 " size :  " + arrValue.length + ", Current Object size : " + 
arrCurrent.length);
 } else {
   throw new IOException("DeprecatedParquetHiveInput can not 
support RecordReaders that" +
 " don't return same key & value & value is null");
 }
   }




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] root18039532923 commented on issue #2623: org.apache.hudi.exception.HoodieDependentSystemUnavailableException:System HBASE unavailable.

2021-03-25 Thread GitBox



root18039532923 commented on issue #2623:
URL: https://github.com/apache/hudi/issues/2623#issuecomment-807857780


   no reply


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] umehrot2 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



umehrot2 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601930831



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##
@@ -0,0 +1,317 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.{HoodieMetadataConfig, 
SerializableConfiguration}
+import org.apache.hudi.common.engine.HoodieLocalEngineContext
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.{InternalRow, expressions}
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.avro.SchemaConverters
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BoundReference, Expression, InterpretedPredicate}
+import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils}
+import org.apache.spark.sql.execution.datasources.{FileIndex, 
PartitionDirectory, PartitionUtils}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+  * A File Index which support partition prune for hoodie snapshot and 
read-optimized
+  * query.
+  * Main steps to get the file list for query:
+  * 1、Load all files and partition values from the table path.
+  * 2、Do the partition prune by the partition filter condition.
+  *
+  * There are 3 cases for this:
+  * 1、If the partition columns size is equal to the actually partition path 
level, we
+  * read it as partitioned table.(e.g partition column is "dt", the partition 
path is "2021-03-10")
+  *
+  * 2、If the partition columns size is not equal to the partition path level, 
but the partition
+  * column size is "1" (e.g. partition column is "dt", but the partition path 
is "2021/03/10"
+  * who'es directory level is 3).We can still read it as a partitioned table. 
We will mapping the
+  * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt").
+  *
+  * 3、Else the the partition columns size is not equal to the partition 
directory level and the
+  * size is great than "1" (e.g. partition column is "dt,hh", the partition 
path is "2021/03/10/12")
+  * , we read it as a None Partitioned table because we cannot know how to 
mapping the partition
+  * path with the partition columns in this case.
+  */
+case class HoodieFileIndex(
+ spark: SparkSession,
+ basePath: String,
+ schemaSpec: Option[StructType],
+ options: Map[String, String])
+  extends FileIndex with Logging {
+
+  @transient private val hadoopConf = spark.sessionState.newHadoopConf()
+  private lazy val metaClient = HoodieTableMetaClient
+.builder().setConf(hadoopConf).setBasePath(basePath).build()
+
+  @transient private val queryPath = new Path(options.getOrElse("path", 
"'path' option required"))
+  /**
+* Get the schema of the table.
+*/
+  lazy val schema: StructType = schemaSpec.getOrElse({
+val schemaUtil = new TableSchemaResolver(metaClient)
+SchemaConverters.toSqlType(schemaUtil.getTableAvroSchema)
+  .dataType.asInstanceOf[StructType]
+  })
+
+  /**
+* Get the partition schema from the hoodie.properties.
+*/
+  private lazy val _partitionSchemaFromProperties: StructType = {
+val tableConfig = metaClient.getTableConfig
+val partitionColumns = tableConfig.getPartitionColumns
+val nameFieldMap = schema.fields.map(filed => filed.name -> filed).toMap
+
+if (partitionColumns.isPresent) {
+  val partitionFields = partitionColumns.get().map(column =>
+nameFieldMap.getOrElse(column, throw new 
IllegalArgumentException(s"Cannot find column " +
+

[GitHub] [hudi] umehrot2 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



umehrot2 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601930658



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##
@@ -0,0 +1,317 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.{HoodieMetadataConfig, 
SerializableConfiguration}
+import org.apache.hudi.common.engine.HoodieLocalEngineContext
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.{InternalRow, expressions}
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.avro.SchemaConverters
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BoundReference, Expression, InterpretedPredicate}
+import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils}
+import org.apache.spark.sql.execution.datasources.{FileIndex, 
PartitionDirectory, PartitionUtils}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+  * A File Index which support partition prune for hoodie snapshot and 
read-optimized
+  * query.
+  * Main steps to get the file list for query:
+  * 1、Load all files and partition values from the table path.
+  * 2、Do the partition prune by the partition filter condition.
+  *
+  * There are 3 cases for this:
+  * 1、If the partition columns size is equal to the actually partition path 
level, we
+  * read it as partitioned table.(e.g partition column is "dt", the partition 
path is "2021-03-10")
+  *
+  * 2、If the partition columns size is not equal to the partition path level, 
but the partition
+  * column size is "1" (e.g. partition column is "dt", but the partition path 
is "2021/03/10"
+  * who'es directory level is 3).We can still read it as a partitioned table. 
We will mapping the
+  * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt").
+  *
+  * 3、Else the the partition columns size is not equal to the partition 
directory level and the
+  * size is great than "1" (e.g. partition column is "dt,hh", the partition 
path is "2021/03/10/12")
+  * , we read it as a None Partitioned table because we cannot know how to 
mapping the partition
+  * path with the partition columns in this case.
+  */
+case class HoodieFileIndex(
+ spark: SparkSession,
+ basePath: String,
+ schemaSpec: Option[StructType],
+ options: Map[String, String])
+  extends FileIndex with Logging {
+
+  @transient private val hadoopConf = spark.sessionState.newHadoopConf()
+  private lazy val metaClient = HoodieTableMetaClient
+.builder().setConf(hadoopConf).setBasePath(basePath).build()
+
+  @transient private val queryPath = new Path(options.getOrElse("path", 
"'path' option required"))
+  /**
+* Get the schema of the table.
+*/
+  lazy val schema: StructType = schemaSpec.getOrElse({
+val schemaUtil = new TableSchemaResolver(metaClient)
+SchemaConverters.toSqlType(schemaUtil.getTableAvroSchema)
+  .dataType.asInstanceOf[StructType]
+  })
+
+  /**
+* Get the partition schema from the hoodie.properties.
+*/
+  private lazy val _partitionSchemaFromProperties: StructType = {
+val tableConfig = metaClient.getTableConfig
+val partitionColumns = tableConfig.getPartitionColumns
+val nameFieldMap = schema.fields.map(filed => filed.name -> filed).toMap
+
+if (partitionColumns.isPresent) {
+  val partitionFields = partitionColumns.get().map(column =>
+nameFieldMap.getOrElse(column, throw new 
IllegalArgumentException(s"Cannot find column " +
+

[GitHub] [hudi] umehrot2 edited a comment on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



umehrot2 edited a comment on pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#issuecomment-807830171


   LGTM now. I mostly have just minor comments now which you can address.
   
   Can you run these unit tests you added once with `-Pspark3` to make sure 
this is running seamlessly for Spark 3 ? The travis tests right now don't run 
the tests with Spark 3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] umehrot2 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



umehrot2 commented on pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#issuecomment-807830171


   Can you run these unit tests you added once with `-Pspark3` to make sure 
this is running seamlessly for Spark 3 ? The travis tests right now don't run 
the tests with Spark 3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] umehrot2 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



umehrot2 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601854740



##
File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
##
@@ -276,6 +276,16 @@ public static void processFiles(FileSystem fs, String 
basePathStr, Functionhttp://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.{HoodieMetadataConfig, 
SerializableConfiguration}
+import org.apache.hudi.common.engine.HoodieLocalEngineContext
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.{InternalRow, expressions}
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.avro.SchemaConverters
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BoundReference, Expression, InterpretedPredicate}
+import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils}
+import org.apache.spark.sql.execution.datasources.{FileIndex, FileStatusCache, 
NoopCache, PartitionDirectory, PartitionUtils}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.mutable
+
+/**
+  * A File Index which support partition prune for hoodie snapshot and 
read-optimized
+  * query.
+  * Main steps to get the file list for query:
+  * 1、Load all files and partition values from the table path.
+  * 2、Do the partition prune by the partition filter condition.
+  *
+  * There are 3 cases for this:
+  * 1、If the partition columns size is equal to the actually partition path 
level, we
+  * read it as partitioned table.(e.g partition column is "dt", the partition 
path is "2021-03-10")
+  *
+  * 2、If the partition columns size is not equal to the partition path level, 
but the partition
+  * column size is "1" (e.g. partition column is "dt", but the partition path 
is "2021/03/10"
+  * who'es directory level is 3).We can still read it as a partitioned table. 
We will mapping the
+  * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt").
+  *
+  * 3、Else the the partition columns size is not equal to the partition 
directory level and the
+  * size is great than "1" (e.g. partition column is "dt,hh", the partition 
path is "2021/03/10/12")
+  * , we read it as a None Partitioned table because we cannot know how to 
mapping the partition
+  * path with the partition columns in this case.
+  */
+case class HoodieFileIndex(
+ spark: SparkSession,
+ metaClient: HoodieTableMetaClient,
+ schemaSpec: Option[StructType],
+ options: Map[String, String],
+ @transient fileStatusCache: FileStatusCache = NoopCache)
+  extends FileIndex with Logging {
+
+  private val basePath = metaClient.getBasePath
+
+  @transient private val queryPath = new Path(options.getOrElse("path", 
"'path' option required"))
+  /**
+* Get the schema of the table.
+*/
+  lazy val schema: StructType = schemaSpec.getOrElse({
+val schemaUtil = new TableSchemaResolver(metaClient)
+SchemaConverters.toSqlType(schemaUtil.getTableAvroSchema)
+  .dataType.asInstanceOf[StructType]
+  })
+
+  /**
+* Get the partition schema from the hoodie.properties.
+*/
+  private lazy val _partitionSchemaFromProperties: StructType = {
+val tableConfig = metaClient.getTableConfig
+val partitionColumns = tableConfig.getPartitionColumns
+val nameFieldMap = schema.fields.map(filed => filed.name -> filed).toMap
+
+if (partitionColumns.isPresent) {
+  val partitionFields = partitionColumns.get().map(column =>
+nameFieldMap.getOrElse(column, throw new 
IllegalArgumentException(s"Cannot find column: '" +
+  s"$column' in the schema[${schema.fields.mkString(",")}]")))
+  new StructType(partitionFields)
+} else { // If the partition columns have not stored in 
hoodie.properites(the table that was
+  // created earlier), we trait it as a none-partitioned table.
+  new StructType()
+}
+  }
+
+  @transient @volatile private var fileSystemView:

[GitHub] [hudi] n3nash commented on a change in pull request #2607: [HUDI-1643] Hudi observability - framework to report stats from execu…

2021-03-25 Thread GitBox



n3nash commented on a change in pull request #2607:
URL: https://github.com/apache/hudi/pull/2607#discussion_r601926515



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
##
@@ -336,8 +340,12 @@ public void write(GenericRecord oldRecord) {
   }
 
   long fileSizeInBytes = FSUtils.getFileSize(fs, newFilePath);
-  HoodieWriteStat stat = writeStatus.getStat();
 
+  // record write metrics
+  executorMetrics.recordWriteMetrics(getIOType(), 
InetAddress.getLocalHost().getHostName(),

Review comment:
   private Map> {
   list.add("io_type", getIOType());
   ..
   ..
   return new Map(CONSTANTS.WRITE_METRICS, list)
   }




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] MyLanPangzi commented on pull request #2719: [HUDI-1721] run_sync_tool support hive3

2021-03-25 Thread GitBox



MyLanPangzi commented on pull request #2719:
URL: https://github.com/apache/hudi/pull/2719#issuecomment-807731849


   Sorry,I forgot comment the issue 
   https://github.com/apache/hudi/issues/2717


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rubenssoto commented on pull request #2699: [HUDI-1709] Improving lock config names and adding hive metastore uri config

2021-03-25 Thread GitBox



rubenssoto commented on pull request #2699:
URL: https://github.com/apache/hudi/pull/2699#issuecomment-807731807


   Hello,
   
   How this change about hive metastore uri works?
   I want to try hive metastore instead of hive to sync metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] satishkotha commented on a change in pull request #2678: Added support for replace commits in commit showpartitions, commit sh…

2021-03-25 Thread GitBox



satishkotha commented on a change in pull request #2678:
URL: https://github.com/apache/hudi/pull/2678#discussion_r594753875



##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
##
@@ -431,4 +442,20 @@ public String syncCommits(@CliOption(key = {"path"}, help 
= "Path of the table t
 return "Load sync state between " + 
HoodieCLI.getTableMetaClient().getTableConfig().getTableName() + " and "
 + HoodieCLI.syncTableMetadata.getTableConfig().getTableName();
   }
+
+  /*
+  Checks whether a commit or replacecommit action exists in the timeline.
+  * */
+  private Option getCommitOrReplaceCommitInstant(HoodieTimeline 
timeline, String instantTime) {

Review comment:
   consider changing signature to return Option of HoodieCommitMetadata and 
deserialize instant details inside this method. This would avoid repetition to 
get instant details in multiple places. You can also do additional validation. 
for example: for replace commit, deserialize  using HoodieReplaceCommitMetadata 
class 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash commented on pull request #2678: Added support for replace commits in commit showpartitions, commit sh…

2021-03-25 Thread GitBox



n3nash commented on pull request #2678:
URL: https://github.com/apache/hudi/pull/2678#issuecomment-807130229


   @jsbali Please file a jira ticket and add it to the heading of this PR 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



codecov-io edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-792430670


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=h1) Report
   > Merging 
[#2645](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=desc) (05ef658) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/900de34e45b4c1d19c01ea84adc38413f2bd52ff?el=desc)
 (900de34) will **increase** coverage by `17.96%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2645/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2645   +/-   ##
   =
   + Coverage 51.76%   69.73%   +17.96% 
   + Complexity 3601  371 -3230 
   =
 Files   476   54  -422 
 Lines 22579 1989-20590 
 Branches   2407  236 -2171 
   =
   - Hits  11688 1387-10301 
   + Misses 9874  471 -9403 
   + Partials   1017  131  -886 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.73% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `71.37% <0.00%> (ø)` | `55.00% <0.00%> (ø%)` | |
   | 
[...ache/hudi/common/fs/SizeAwareDataOutputStream.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL1NpemVBd2FyZURhdGFPdXRwdXRTdHJlYW0uamF2YQ==)
 | | | |
   | 
[...rg/apache/hudi/common/model/HoodieRollingStat.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJvbGxpbmdTdGF0LmphdmE=)
 | | | |
   | 
[...rc/main/java/org/apache/hudi/ApiMaturityLevel.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvQXBpTWF0dXJpdHlMZXZlbC5qYXZh)
 | | | |
   | 
[...ache/hudi/source/StreamReadMonitoringFunction.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvU3RyZWFtUmVhZE1vbml0b3JpbmdGdW5jdGlvbi5qYXZh)
 | | | |
   | 
[.../hudi/common/bloom/InternalDynamicBloomFilter.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL0ludGVybmFsRHluYW1pY0Jsb29tRmlsdGVyLmphdmE=)
 | | | |
   | 
[...e/hudi/common/util/queue/BoundedInMemoryQueue.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvcXVldWUvQm91bmRlZEluTWVtb3J5UXVldWUuamF2YQ==)
 | | | |
   | 
[...a/org/apache/hudi/cli/HoodieTableHeaderFields.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZVRhYmxlSGVhZGVyRmllbGRzLmphdmE=)
 | | | |
   | 
[...i/table/format/cow/Int64TimestampColumnReader.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9mb3JtYXQvY293L0ludDY0VGltZXN0YW1wQ29sdW1uUmVhZGVyLmphdmE=)
 | | | |
   | 
[...udi/spark3/internal/HoodieWriterCommitMessage.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmszL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3NwYXJrMy9pbnRlcm5hbC9Ib29kaWVXcml0ZXJDb21taXRNZXNzYWdlLmphdmE=)
 | | | |
   | ... and [405 
more](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree-more) | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1716) rt view w/ MOR tables fails after schema evolution

2021-03-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1716:
--
Description: 
Looks like realtime view w/ MOR table fails if schema present in existing log 
file is evolved to add a new field. no issues w/ writing. but reading fails

More info: [https://github.com/apache/hudi/issues/2675]

 

gist of the stack trace:

Caused by: org.apache.avro.AvroTypeException: Found 
hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting 
hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field 
evolvedFieldCaused by: org.apache.avro.AvroTypeException: Found 
hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting 
hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field 
evolvedField at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) at 
org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at 
org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) 
at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
 at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
 at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) at 
org.apache.hudi.common.table.log.block.HoodieAvroDataBlock.deserializeRecords(HoodieAvroDataBlock.java:165)
 at 
org.apache.hudi.common.table.log.block.HoodieDataBlock.createRecordsFromContentBytes(HoodieDataBlock.java:128)
 at 
org.apache.hudi.common.table.log.block.HoodieDataBlock.getRecords(HoodieDataBlock.java:106)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processDataBlock(AbstractHoodieLogRecordScanner.java:289)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processQueuedBlocksForInstant(AbstractHoodieLogRecordScanner.java:324)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:252)
 ... 24 more21/03/25 11:27:03 WARN TaskSetManager: Lost task 0.0 in stage 83.0 
(TID 667, sivabala-c02xg219jgh6.attlocal.net, executor driver): 
org.apache.hudi.exception.HoodieException: Exception when reading log file  at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:261)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:100)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:93)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:75)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:230)
 at 
org.apache.hudi.HoodieMergeOnReadRDD$.scanLog(HoodieMergeOnReadRDD.scala:328) 
at 
org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.(HoodieMergeOnReadRDD.scala:210)
 at 
org.apache.hudi.HoodieMergeOnReadRDD.payloadCombineFileIterator(HoodieMergeOnReadRDD.scala:200)
 at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:77)

 

Logs from local run: 

[https://gist.github.com/nsivabalan/656956ab313676617d84002ef8942198]

diff with which above logs were generated: 
[https://gist.github.com/nsivabalan/84dad29bc1ab567ebb6ee8c63b3969ec]

 

Steps to reproduce in spark shell:
 # create MOR table w/ schema1. 
 # Ingest (with schema1) until log files are created. // verify via hudi-cli. I 
didn't see log files w/ just 1 batch of updates. If not, do multiple rounds 
until you see log files.
 # create a new schema2 with one new additional field. ingest a batch with 
schema2 that updates existing records. 
 # read entire dataset. 

 

 

 

  was:
Looks like realtime view w/ MOR table fails if schema present in existing log 
file is evolved to add a new field. no issues w/ writing. but reading fails

More info: [https://github.com/apache/hudi/issues/2675]

 

Logs from local run: 

[https://gist.github.com/nsivabalan/656956ab313676617d84002ef8942198]

diff with which above logs were generated: 
[https://gist.github.com/nsivabalan/84dad29bc1ab567ebb6ee8c63b3969ec]

 

Steps to reproduce in spark shell:
 # create MOR table w/ schema1. 
 # Ingest (with schema1) until log files are created. // verify via hudi-cli. I 
didn't see log files w/ just 1 batch of updates. If not, do multiple rounds 
until you see log files.
 # create a new schema2 with one new additional field. ingest a batch with 
schema2 that updates existing records. 
 # read entire dataset. 

 

 

 


> rt view w/ MOR tables fails after schema evolution
> --
>
> Key: HUDI-1716
> URL: https://issues.apache.org/jira/browse/HUDI-1716
> Project: Apache Hudi
>  Issue Type: Bug
>

[jira] [Updated] (HUDI-1716) rt view w/ MOR tables fails after schema evolution

2021-03-25 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1716:
--
Description: 
Looks like realtime view w/ MOR table fails if schema present in existing log 
file is evolved to add a new field. no issues w/ writing. but reading fails

More info: [https://github.com/apache/hudi/issues/2675]

 

gist of the stack trace:

Caused by: org.apache.avro.AvroTypeException: Found 
hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting 
hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field 
evolvedFieldCaused by: org.apache.avro.AvroTypeException: Found 
hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting 
hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field 
evolvedField at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) at 
org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at 
org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) 
at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
 at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
 at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) at 
org.apache.hudi.common.table.log.block.HoodieAvroDataBlock.deserializeRecords(HoodieAvroDataBlock.java:165)
 at 
org.apache.hudi.common.table.log.block.HoodieDataBlock.createRecordsFromContentBytes(HoodieDataBlock.java:128)
 at 
org.apache.hudi.common.table.log.block.HoodieDataBlock.getRecords(HoodieDataBlock.java:106)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processDataBlock(AbstractHoodieLogRecordScanner.java:289)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processQueuedBlocksForInstant(AbstractHoodieLogRecordScanner.java:324)
 at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:252)
 ... 24 more21/03/25 11:27:03 WARN TaskSetManager: Lost task 0.0 in stage 83.0 
(TID 667, sivabala-c02xg219jgh6.attlocal.net, executor driver): 
org.apache.hudi.exception.HoodieException: Exception when reading log file  at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:261)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:100)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:93)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:75)
 at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:230)
 at 
org.apache.hudi.HoodieMergeOnReadRDD$.scanLog(HoodieMergeOnReadRDD.scala:328) 
at 
org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.(HoodieMergeOnReadRDD.scala:210)
 at 
org.apache.hudi.HoodieMergeOnReadRDD.payloadCombineFileIterator(HoodieMergeOnReadRDD.scala:200)
 at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:77)

 

Logs from local run: 

[https://gist.github.com/nsivabalan/656956ab313676617d84002ef8942198]

diff with which above logs were generated: 
[https://gist.github.com/nsivabalan/84dad29bc1ab567ebb6ee8c63b3969ec]

 

Steps to reproduce in spark shell:
 # create MOR table w/ schema1. 
 # Ingest (with schema1) until log files are created. // verify via hudi-cli. 
It took me 2 batch of updates to see a log file.
 # create a new schema2 with one new additional field. ingest a batch with 
schema2 that updates existing records. 
 # read entire dataset. 

 

 

 

  was:
Looks like realtime view w/ MOR table fails if schema present in existing log 
file is evolved to add a new field. no issues w/ writing. but reading fails

More info: [https://github.com/apache/hudi/issues/2675]

 

gist of the stack trace:

Caused by: org.apache.avro.AvroTypeException: Found 
hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting 
hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field 
evolvedFieldCaused by: org.apache.avro.AvroTypeException: Found 
hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting 
hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field 
evolvedField at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) at 
org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at 
org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) 
at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
 at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
 at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at

[GitHub] [hudi] garyli1019 commented on a change in pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread GitBox



garyli1019 commented on a change in pull request #2721:
URL: https://github.com/apache/hudi/pull/2721#discussion_r601540575



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
##
@@ -95,15 +103,24 @@ public boolean next(NullWritable aVoid, ArrayWritable 
arrayWritable) throws IOEx
 // TODO(NA): Invoke preCombine here by converting arrayWritable to 
Avro. This is required since the
 // deltaRecord may not be a full record and needs values of columns 
from the parquet
 Option rec;
-if (usesCustomPayload) {
-  rec = 
deltaRecordMap.get(key).getData().getInsertValue(getWriterSchema());
-} else {
-  rec = 
deltaRecordMap.get(key).getData().getInsertValue(getReaderSchema());
+rec = buildGenericRecordwithCustomPayload(deltaRecordMap.get(key));
+// If the record is not present, this is a delete record using an 
empty payload so skip this base record
+// and move to the next record
+while (!rec.isPresent()) {
+  // if current parquet reader has no record, return false
+  if (!this.parquetReader.next(aVoid, arrayWritable)) {

Review comment:
   if parquet has records, this will get the record but we didn't read it, 
so I guess we will miss a record here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



codecov-io edited a comment on pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#issuecomment-794945140


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2651?src=pr=h1) Report
   > Merging 
[#2651](https://codecov.io/gh/apache/hudi/pull/2651?src=pr=desc) (baa4876) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/ce3e8ec87083ef4cd4f33de39b6697f66ff3f277?el=desc)
 (ce3e8ec) will **increase** coverage by `0.16%`.
   > The diff coverage is `74.49%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2651/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2651?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2651  +/-   ##
   
   + Coverage 51.76%   51.93%   +0.16% 
   - Complexity 3602 3645  +43 
   
 Files   476  478   +2 
 Lines 2257922797 +218 
 Branches   2408 2447  +39 
   
   + Hits  1168811839 +151 
   - Misses 9874 9906  +32 
   - Partials   1017 1052  +35 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.88% <0.00%> (-0.05%)` | `0.00 <0.00> (ø)` | |
   | hudiflink | `54.08% <ø> (-0.20%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `71.78% <79.73%> (+0.91%)` | `0.00 <26.00> (ø)` | |
   | hudisync | `45.58% <ø> (-0.12%)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.72% <50.00%> (-0.06%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2651?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...c/main/java/org/apache/hudi/common/fs/FSUtils.java](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0ZTVXRpbHMuamF2YQ==)
 | `47.34% <0.00%> (-0.94%)` | `57.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/common/table/HoodieTableConfig.java](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlQ29uZmlnLmphdmE=)
 | `43.20% <0.00%> (-2.25%)` | `17.00 <0.00> (ø)` | |
   | 
[...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh)
 | `66.66% <0.00%> (-1.65%)` | `43.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `71.28% <50.00%> (-0.45%)` | `56.00 <0.00> (ø)` | |
   | 
[...src/main/scala/org/apache/hudi/DefaultSource.scala](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0RlZmF1bHRTb3VyY2Uuc2NhbGE=)
 | `79.38% <68.42%> (-4.77%)` | `31.00 <0.00> (+14.00)` | :arrow_down: |
   | 
[...c/main/scala/org/apache/hudi/HoodieFileIndex.scala](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZUZpbGVJbmRleC5zY2FsYQ==)
 | `79.33% <79.33%> (ø)` | `24.00 <24.00> (?)` | |
   | 
[.../org/apache/hudi/MergeOnReadSnapshotRelation.scala](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL01lcmdlT25SZWFkU25hcHNob3RSZWxhdGlvbi5zY2FsYQ==)
 | `89.79% <87.50%> (+0.66%)` | `18.00 <1.00> (+1.00)` | |
   | 
[...cala/org/apache/hudi/HoodieBootstrapRelation.scala](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZUJvb3RzdHJhcFJlbGF0aW9uLnNjYWxh)
 | `89.01% <100.00%> (+1.51%)` | `18.00 <1.00> (+3.00)` | |
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh)
 | `58.33% <100.00%> (+0.54%)` | `0.00 <0.00> (ø)` | |
   |

[GitHub] [hudi] codecov-io commented on pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread GitBox



codecov-io commented on pull request #2721:
URL: https://github.com/apache/hudi/pull/2721#issuecomment-806823048


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2721?src=pr=h1) Report
   > Merging 
[#2721](https://codecov.io/gh/apache/hudi/pull/2721?src=pr=desc) (fb529db) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50?el=desc)
 (6e803e0) will **decrease** coverage by `42.32%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2721/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2721?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2721   +/-   ##
   
   - Coverage 51.72%   9.40%   -42.33% 
   + Complexity 3601  48 -3553 
   
 Files   476  54  -422 
 Lines 225951989-20606 
 Branches   2409 236 -2173 
   
   - Hits  11687 187-11500 
   + Misses 98891789 -8100 
   + Partials   1019  13 -1006 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.40% <ø> (-60.34%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2721?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%>

[GitHub] [hudi] cdmikechen commented on a change in pull request #2719: [HUDI-1721] run_sync_tool support hive3

2021-03-25 Thread GitBox



cdmikechen commented on a change in pull request #2719:
URL: https://github.com/apache/hudi/pull/2719#discussion_r601483903



##
File path: hudi-sync/hudi-hive-sync/run_sync_tool.sh
##
@@ -49,6 +49,15 @@ fi
 HIVE_JACKSON=`ls ${HIVE_HOME}/lib/jackson-*.jar | tr '\n' ':'`
 HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON
 
+HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'`
+if [ -n "$HIVE_CALCITE" ]; then
+HIVE_JARS=$HIVE_JARS:$HIVE_CALCITE
+fi
+HIVE_LIBFB303=`ls ${HIVE_HOME}/lib/libfb303-*.jar | tr '\n' ':'`
+if [ -n "$HIVE_LIBFB303" ]; then
+HIVE_JARS=$HIVE_JARS:$HIVE_LIBFB303

Review comment:
   I had the same problem, hive3 need these jars to be connected. The 
dependency packages of hive2 and hive3 on the connection have been changed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] cdmikechen commented on a change in pull request #2719: [HUDI-1721] run_sync_tool support hive3

2021-03-25 Thread GitBox



cdmikechen commented on a change in pull request #2719:
URL: https://github.com/apache/hudi/pull/2719#discussion_r601483289



##
File path: hudi-sync/hudi-hive-sync/run_sync_tool.sh
##
@@ -49,6 +49,15 @@ fi
 HIVE_JACKSON=`ls ${HIVE_HOME}/lib/jackson-*.jar | tr '\n' ':'`
 HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON
 
+HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'`
+if [ -n "$HIVE_CALCITE" ]; then
+HIVE_JARS=$HIVE_JARS:$HIVE_CALCITE
+fi
+HIVE_LIBFB303=`ls ${HIVE_HOME}/lib/libfb303-*.jar | tr '\n' ':'`
+if [ -n "$HIVE_LIBFB303" ]; then
+HIVE_JARS=$HIVE_JARS:$HIVE_LIBFB303

Review comment:
   I can explain. I had the same problem, Hudi lacks the dependency libs 
for hive3 connection. The dependency packages of hive2 and hive3 on the 
connection have been changed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] cdmikechen commented on a change in pull request #2719: [HUDI-1721] run_sync_tool support hive3

2021-03-25 Thread GitBox



cdmikechen commented on a change in pull request #2719:
URL: https://github.com/apache/hudi/pull/2719#discussion_r601476629



##
File path: hudi-sync/hudi-hive-sync/run_sync_tool.sh
##
@@ -49,6 +49,15 @@ fi
 HIVE_JACKSON=`ls ${HIVE_HOME}/lib/jackson-*.jar | tr '\n' ':'`
 HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON
 
+HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'`
+if [ -n "$HIVE_CALCITE" ]; then
+HIVE_JARS=$HIVE_JARS:$HIVE_CALCITE
+fi
+HIVE_LIBFB303=`ls ${HIVE_HOME}/lib/libfb303-*.jar | tr '\n' ':'`
+if [ -n "$HIVE_LIBFB303" ]; then
+HIVE_JARS=$HIVE_JARS:$HIVE_LIBFB303

Review comment:
   I had the same problem, hive3 need these jars to be connected. The 
dependency packages of hive2 and hive3 on the connection have been changed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread GitBox



xiarixiaoyao edited a comment on pull request #2722:
URL: https://github.com/apache/hudi/pull/2722#issuecomment-806721657


   test step:
   before patch:
   step1:
   
   val df = spark.range(0, 10).toDF("keyid")
   .withColumn("col3", expr("keyid"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit(Array[String] ("sb1", "rz")))
   .withColumn("a2", lit(Array[String] ("sb1", "rz")))
   
   // create hoodie table hive_14b
   
merge(df, 4, "default", "hive_14b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
   
   notice:  bulk_insert will produce 4 files in hoodie table
   
   step2:
   
val df = spark.range(9, 12).toDF("keyid")
   .withColumn("col3", expr("keyid"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit(Array[String] ("sb1", "rz")))
   .withColumn("a2", lit(Array[String] ("sb1", "rz")))
   
   // upsert table 
   
   merge(df, 4, "default", "hive_14b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
   
now :  we have four base files and one log file in hoodie table
   
   step3: 
   
   spark-sql/beeline: 
   
select count(col3) from hive_14b_rt;
   
   then the query failed.
   2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: 
java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher 
event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: 
Error: java.lang.NullPointerException at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 at org.apache.hudi.
 
hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) 
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at javax
 .security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)
   
   
   after patch:
   spark-sql/hive-beeline 
select count(col3) from hive_14b_rt;
   +-+
   |   _c0   |
   +-+
   | 12  |
   +-+
   
   merge function:
   def merge(df: org.apache.spark.sql.DataFrame, par: Int, db: String, 
tableName: String,
   tableType: String = DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
   hivePartitionExtract: String = 
"org.apache.hudi.hive.MultiPartKeysValueExtractor", op: String = "upsert"): 
Unit = {
   val mode = if (op.equals("bulk_insert")) {
   Overwrite
   } else {
   Append
   }
   df.write.format("hudi").
   option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType).
   option(HoodieCompactionConfig.INLINE_COMPACT_PROP, false).
   option(PRECOMBINE_FIELD_OPT_KEY, "col3").
   option(RECORDKEY_FIELD_OPT_KEY, "keyid").
   option(PARTITIONPATH_FIELD_OPT_KEY, "p,p1,p2").
   option(DataSourceWriteOptions.OPERATION_OPT_KEY, op).
   option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, 
classOf[ComplexKeyGenerator].getName).
   option("hoodie.bulkinsert.shuffle.parallelism", par.toString).
   option("hoodie.metadata.enable", "false").
   option("hoodie.insert.shuffle.parallelism", par.toString).
   option("hoodie.upsert.shuffle.parallelism", par.toString).
   option("hoodie.delete.shuffle.parallelism", par.toString).
   option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true").
   option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "p,p1,p2").

[GitHub] [hudi] xiarixiaoyao commented on pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread GitBox



xiarixiaoyao commented on pull request #2722:
URL: https://github.com/apache/hudi/pull/2722#issuecomment-806722454


   @garyli1019 could you pls help me review this pr, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread GitBox



xiarixiaoyao commented on pull request #2722:
URL: https://github.com/apache/hudi/pull/2722#issuecomment-806721657


   test step:
   before patch:
   step1:
   
   val df = spark.range(0, 10).toDF("keyid")
   .withColumn("col3", expr("keyid"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit(Array[String] ("sb1", "rz")))
   .withColumn("a2", lit(Array[String] ("sb1", "rz")))
   
   // create hoodie table hive_14b
   
merge(df, 4, "default", "hive_14b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
   
   notice:  bulk_insert will produce 4 files in hoodie table
   
   step2:
   
val df = spark.range(9, 12).toDF("keyid")
   .withColumn("col3", expr("keyid"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit(Array[String] ("sb1", "rz")))
   .withColumn("a2", lit(Array[String] ("sb1", "rz")))
   
   // upsert table 
   
   merge(df, 4, "default", "hive_14b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
   
now :  we have four base files and one log file in hoodie table
   
   step3: 
   
   spark-sql/beeline: 
   
select count(col3) from hive_14b_rt;
   
   then the query failed.
   2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: 
java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher 
event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: 
Error: java.lang.NullPointerException at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 at org.apache.hudi.
 
hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) 
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at javax
 .security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)
   
   
   after patch:
   spark-sql/hive-beeline 
select count(col3) from hive_14b_rt;
   +-+
   |   _c0   |
   +-+
   | 12  |
   +-+
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io commented on pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed

2021-03-25 Thread GitBox



codecov-io commented on pull request #2716:
URL: https://github.com/apache/hudi/pull/2716#issuecomment-806719923


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2716?src=pr=h1) Report
   > Merging 
[#2716](https://codecov.io/gh/apache/hudi/pull/2716?src=pr=desc) (9ffe931) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50?el=desc)
 (6e803e0) will **decrease** coverage by `42.32%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2716/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2716?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2716   +/-   ##
   
   - Coverage 51.72%   9.40%   -42.33% 
   + Complexity 3601  48 -3553 
   
 Files   476  54  -422 
 Lines 225951989-20606 
 Branches   2409 236 -2173 
   
   - Hits  11687 187-11500 
   + Misses 98891789 -8100 
   + Partials   1019  13 -1006 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.40% <ø> (-60.34%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2716?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%>

[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1722:
-
Labels: pull-request-available  (was: )

> hive beeline/spark-sql  query specified field on mor table occur NPE
> 
>
> Key: HUDI-1722
> URL: https://issues.apache.org/jira/browse/HUDI-1722
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration, Spark Integration
>Affects Versions: 0.7.0
> Environment: spark2.4.5, hadoop3.1.1, hive 3.1.1
>Reporter: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> HUDI-892 introduce this problem。
>  this pr skip adding projection columns if there are no log files in the 
> hoodieRealtimeSplit。 but this pr donnot consider that multiple 
> getRecordReaders share same jobConf。
>  Consider the following questions：
>  we have four getRecordReaders: 
>  reader1(its hoodieRealtimeSplit contains no log files)
>  reader2 (its hoodieRealtimeSplit contains log files)
>  reader3(its hoodieRealtimeSplit contains log files)
>  reader4(its hoodieRealtimeSplit contains no log files)
> now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
> jobConf will be set to be true, and no hoodie additional projection columns 
> will be added to jobConf （see 
> HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
> reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
> jobConf is set to be true, no hoodie additional projection columns will be 
> added to jobConf. （see 
> HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
>  which lead to the result that _hoodie_record_key would be missing and merge 
> step would throw exceptions
> 2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher event handler | Diagnostics 
> report from attempt_1615883368881_0038_m_00_0: Error: 
> java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO  | 
> AsyncDispatcher event handler | Diagnostics report from 
> attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException 
> at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
>  at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
>  at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
>  at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
>  at 
> org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68)
>  at 
> org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77)
>  at 
> org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42)
>  at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205)
>  at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) 
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at 
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
> org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)
>  
> Obviously, this is an occasional problem。 if reader2 run first, hoodie 
> additional projection columns will be added to jobConf and in this case the 
> query will be ok
> sparksql can avoid this problem by set  spark.hadoop.cloneConf=true which is 
> not recommended in spark， however hive has no way to avoid this problem。
>  test step：
> step1:
> val df = spark.range(0, 10).toDF("keyid")
>  .withColumn("col3", expr("keyid"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String] ("sb1", "rz")))
>  .withColumn("a2", lit(Array[String] ("sb1", "rz")))
> // create hoodie table hive_14b
>  merge(df, 4, "default", "hive_14b", 
>

[GitHub] [hudi] xiarixiaoyao opened a new pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread GitBox



xiarixiaoyao opened a new pull request #2722:
URL: https://github.com/apache/hudi/pull/2722


   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Fix the bug introduced by HUDI-892
   HUDI-892 introduce this problem。
   this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
   Consider the following questions：
   we have four getRecordReaders:
   reader1(its hoodieRealtimeSplit contains no log files)
   reader2 (its hoodieRealtimeSplit contains log files)
   reader3(its hoodieRealtimeSplit contains log files)
   reader4(its hoodieRealtimeSplit contains no log files)
   
   now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf （see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
   
   reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
   which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions
   
   2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: 
java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher 
event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: 
Error: java.lang.NullPointerException at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 at org.apache.hudi.
 
hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) 
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at javax
 .security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)
   

   Obviously, this is an occasional problem。 if reader2 run first, hoodie 
additional projection columns will be added to jobConf and in this case the 
query will be ok
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   
   This pull request is already covered by existing tests
   
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a change in pull request #2719: [HUDI-1721] run_sync_tool support hive3

2021-03-25 Thread GitBox



danny0405 commented on a change in pull request #2719:
URL: https://github.com/apache/hudi/pull/2719#discussion_r601445623



##
File path: hudi-sync/hudi-hive-sync/run_sync_tool.sh
##
@@ -49,6 +49,15 @@ fi
 HIVE_JACKSON=`ls ${HIVE_HOME}/lib/jackson-*.jar | tr '\n' ':'`
 HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON
 
+HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'`
+if [ -n "$HIVE_CALCITE" ]; then
+HIVE_JARS=$HIVE_JARS:$HIVE_CALCITE
+fi
+HIVE_LIBFB303=`ls ${HIVE_HOME}/lib/libfb303-*.jar | tr '\n' ':'`
+if [ -n "$HIVE_LIBFB303" ]; then
+HIVE_JARS=$HIVE_JARS:$HIVE_LIBFB303

Review comment:
   Can you explain why we need these 2 kinds of jars ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread tao meng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1722:
---
Description: 
HUDI-892 introduce this problem。
 this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
 Consider the following questions：
 we have four getRecordReaders: 
 reader1(its hoodieRealtimeSplit contains no log files)
 reader2 (its hoodieRealtimeSplit contains log files)
 reader3(its hoodieRealtimeSplit contains log files)
 reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf （see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf）

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
 which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher event handler | Diagnostics 
report from attempt_1615883368881_0038_m_00_0: Error: 
java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher 
event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: 
Error: java.lang.NullPointerException at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) 
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)

 

Obviously, this is an occasional problem。 if reader2 run first, hoodie 
additional projection columns will be added to jobConf and in this case the 
query will be ok

sparksql can avoid this problem by set  spark.hadoop.cloneConf=true which is 
not recommended in spark， however hive has no way to avoid this problem。

 test step：

step1:

val df = spark.range(0, 10).toDF("keyid")
 .withColumn("col3", expr("keyid"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String] ("sb1", "rz")))
 .withColumn("a2", lit(Array[String] ("sb1", "rz")))

// create hoodie table hive_14b

 merge(df, 4, "default", "hive_14b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

notice:  bulk_insert will produce 4 files in hoodie table

step2:

 val df = spark.range(9, 12).toDF("keyid")
 .withColumn("col3", expr("keyid"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String] ("sb1", "rz")))
 .withColumn("a2", lit(Array[String] ("sb1", "rz")))

// upsert table 

merge(df, 4, "default", "hive_14b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

 now :  we have four base files and one log file in hoodie table

step3: 

spark-sql/beeline: 

 select count(col3) from hive_14b_rt;

then the query failed.

  was:
HUDI-892 introduce this problem。
 this pr skip adding projection columns if there are no log files in the

[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread tao meng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1722:
---
Description: 
HUDI-892 introduce this problem。
 this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
 Consider the following questions：
 we have four getRecordReaders: 
 reader1(its hoodieRealtimeSplit contains no log files)
 reader2 (its hoodieRealtimeSplit contains log files)
 reader3(its hoodieRealtimeSplit contains log files)
 reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf （see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf）

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
 which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

Caused by: java.io.IOException: java.lang.NullPointerException
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296)
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252)
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)
 ... 24 more
 Caused by: java.lang.NullPointerException
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) 
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) 
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) 
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)

Obviously, this is an occasional problem。 if reader2 run first, hoodie 
additional projection columns will be added to jobConf and in this case the 
query will be ok

sparksql can avoid this problem by set  spark.hadoop.cloneConf=true which is 
not recommended in spark， however hive has no way to avoid this problem。

 

 

 

  was:
HUDI-892 introduce this problem。
 this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
 Consider the following questions：
 we have four getRecordReaders: 
 reader1(its hoodieRealtimeSplit contains no log files)
 reader2 (its hoodieRealtimeSplit contains log files)
 reader3(its hoodieRealtimeSplit contains log files)
 reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf （see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf）

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
 which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

Caused by: java.io.IOException: java.lang.NullPointerException
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296)
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252)
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)
 ... 24 more
 Caused by: java.lang.NullPointerException
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93)
 at

[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread tao meng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1722:
---
Description: 
HUDI-892 introduce this problem。
 this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
 Consider the following questions：
 we have four getRecordReaders: 
 reader1(its hoodieRealtimeSplit contains no log files)
 reader2 (its hoodieRealtimeSplit contains log files)
 reader3(its hoodieRealtimeSplit contains log files)
 reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf （see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf）

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
 which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

Caused by: java.io.IOException: java.lang.NullPointerException
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296)
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252)
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)
 ... 24 more
 Caused by: java.lang.NullPointerException
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) 
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) 
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) 
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)

Obviously, this is an occasional problem。 if reader2 run first, hoodie 
additional projection columns will be added to jobConf and in this case the 
query will be ok

sparksql can avoid this problem by set  spark.hadoop.cloneConf=true which is 
not recommended in spark， however hive has no way to avoid this problem。

 

  was:
HUDI-892 introduce this problem。
this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
Consider the following questions：
we have four getRecordReaders: 
reader1(its hoodieRealtimeSplit contains no log files)
reader2 (its hoodieRealtimeSplit contains log files)
reader3(its hoodieRealtimeSplit contains log files)
reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf （see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf）

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

Caused by: java.io.IOException: java.lang.NullPointerException
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296)
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252)
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)
 ... 24 more
Caused by: java.lang.NullPointerException
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93)
 at

[jira] [Created] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread tao meng (Jira)

tao meng created HUDI-1722:
--

 Summary: hive beeline/spark-sql  query specified field on mor 
table occur NPE
 Key: HUDI-1722
 URL: https://issues.apache.org/jira/browse/HUDI-1722
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration, Spark Integration
Affects Versions: 0.7.0
 Environment: spark2.4.5, hadoop3.1.1, hive 3.1.1
Reporter: tao meng
 Fix For: 0.9.0


HUDI-892 introduce this problem。
this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
Consider the following questions：
we have four getRecordReaders: 
reader1(its hoodieRealtimeSplit contains no log files)
reader2 (its hoodieRealtimeSplit contains log files)
reader3(its hoodieRealtimeSplit contains log files)
reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf （see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf）

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

Caused by: java.io.IOException: java.lang.NullPointerException
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296)
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252)
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)
 ... 24 more
Caused by: java.lang.NullPointerException
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) 
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) 
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) 
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)


Obviously, this is an occasional problem。 if reader2 run first, hoodie 
additional projection columns will be added to jobConf and in this case the 
query will be ok



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] xiarixiaoyao commented on pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed

2021-03-25 Thread GitBox



xiarixiaoyao commented on pull request #2716:
URL: https://github.com/apache/hudi/pull/2716#issuecomment-806584838


   @garyli1019  could you help me review this pr, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #2720: [HUDI-1719]hive on spark/mr,Incremental query of the mor table, the partition field is incorrect

2021-03-25 Thread GitBox



xiarixiaoyao commented on pull request #2720:
URL: https://github.com/apache/hudi/pull/2720#issuecomment-806584615


   @garyli1019  could you help me review this pr, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread GitBox



xiarixiaoyao commented on pull request #2721:
URL: https://github.com/apache/hudi/pull/2721#issuecomment-806584382


   @garyli1019  could you help me review this pr, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread GitBox



xiarixiaoyao edited a comment on pull request #2721:
URL: https://github.com/apache/hudi/pull/2721#issuecomment-806582989


   test step:
   
   before patch：
   
   step1:
   
   val df = spark.range(0, 100).toDF("keyid")
   .withColumn("col3", expr("keyid + 1000"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit(Array[String] ("sb1", "rz")))
   .withColumn("a2", lit(Array[String] ("sb1", "rz")))
   
   // bulk_insert 100w row (keyid from 0 to 100)
   
   merge(df, 4, "default", "hive_9b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
   
   step2:
   
   val df = spark.range(0, 90).toDF("keyid")
   .withColumn("col3", expr("keyid + 1000"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit(Array[String] ("sb1", "rz")))
   .withColumn("a2", lit(Array[String] ("sb1", "rz")))
   
   // delete 90w row (keyid from 0 to 90)
   
   delete(df, 4, "default", "hive_9b")
   
   step3:
   
   query on beeline/spark-sql :  select count(col3)  from hive_9b_rt
   2021-03-25 15:33:29,029 | INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, 
RECORDS_OUT_INTERMEDIATE:1,  | Operator.java:10382021-03-25 15:33:29,029 | INFO 
 | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1,  | 
Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running child 
: java.lang.StackOverflowError at 
org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at 
org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39)
 at 
org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344)
 at 
org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503)
 at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
 at 
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409)
 at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
 at
  
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at org.apache.hudi.hadoop.realtime.Real
 timeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at org.apache.hudi.hadoop.realtime.RealtimeCom
 pactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
   
   After patch：
select count(col3) from hive_9b_rt

[GitHub] [hudi] xiarixiaoyao commented on pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread GitBox



xiarixiaoyao commented on pull request #2721:
URL: https://github.com/apache/hudi/pull/2721#issuecomment-806582989


   test step:
   
   before patch：
   
   step1:
   
   val df = spark.range(0, 100).toDF("keyid")
   .withColumn("col3", expr("keyid + 1000"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit(Array[String] ("sb1", "rz")))
   .withColumn("a2", lit(Array[String] ("sb1", "rz")))
   
   // bulk_insert 100w row (keyid from 0 to 100)
   
   merge(df, 4, "default", "hive_9b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
   
   step2:
   
   val df = spark.range(0, 90).toDF("keyid")
   .withColumn("col3", expr("keyid + 1000"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit(Array[String] ("sb1", "rz")))
   .withColumn("a2", lit(Array[String] ("sb1", "rz")))
   
   // delete 90w row (keyid from 0 to 90)
   
   delete(df, 4, "default", "hive_9b")
   
   step3:
   
   query on beeline/spark-sql :  select count(col3)  from hive_9b_rt
   2021-03-25 15:33:29,029 | INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, 
RECORDS_OUT_INTERMEDIATE:1,  | Operator.java:10382021-03-25 15:33:29,029 | INFO 
 | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1,  | 
Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running child 
: java.lang.StackOverflowError at 
org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at 
org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39)
 at 
org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344)
 at 
org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503)
 at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
 at 
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409)
 at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
 at
  
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at org.apache.hudi.hadoop.realtime.Real
 timeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at org.apache.hudi.hadoop.realtime.RealtimeCom
 pactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
   
   After patch：
select count(col3) from hive_9b_rt
+-+

[jira] [Commented] (HUDI-1721) run_sync_tool support hive3.1.2 on hadoop3.1.4

2021-03-25 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308606#comment-17308606
 ] 

Danny Chen commented on HUDI-1721:
--

Thanks for the contribution, would review soon ~

> run_sync_tool support hive3.1.2 on hadoop3.1.4 
> ---
>
> Key: HUDI-1721
> URL: https://issues.apache.org/jira/browse/HUDI-1721
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.8.0
>Reporter: 谢波
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/hudi/issues/2717]
> run_sync_tool support hive3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1720:
-
Labels: pull-request-available  (was: )

> when query incr view of  mor table which has many delete records use 
> sparksql/hive-beeline,  StackOverflowError
> ---
>
> Key: HUDI-1720
> URL: https://issues.apache.org/jira/browse/HUDI-1720
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration, Spark Integration
>Affects Versions: 0.7.0, 0.8.0
>Reporter: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
>  now RealtimeCompactedRecordReader.next   deal with delete record by 
> recursion, see:
> [https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106]
> however when the log file contains many delete record,  the logcial of 
> RealtimeCompactedRecordReader.next  will lead stackOverflowError
> test step:
> step1:
> val df = spark.range(0, 100).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // bulk_insert 100w row (keyid from 0 to 100)
> merge(df, 4, "default", "hive_9b", 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> step2:
> val df = spark.range(0, 90).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // delete 90w row (keyid from 0 to 90)
> delete(df, 4, "default", "hive_9b")
> step3:
> query on beeline/spark-sql :  select count(col3)  from hive_9b_rt
> 2021-03-25 15:33:29,029 | INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, 
> RECORDS_OUT_INTERMEDIATE:1,  | Operator.java:10382021-03-25 15:33:29,029 | 
> INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1,  | 
> Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running 
> child : java.lang.StackOverflowError at 
> org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) 
> at 
> org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
>

[GitHub] [hudi] xiarixiaoyao opened a new pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread GitBox



xiarixiaoyao opened a new pull request #2721:
URL: https://github.com/apache/hudi/pull/2721


   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Fix the StackOverflowError on [HUDI-1720] 
   
   now RealtimeCompactedRecordReader.next   deal with delete records by 
recursion, see:
   
https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106
   however when the log file contains many delete record,  the logcial of 
RealtimeCompactedRecordReader.next  will lead stackOverflowError
   
   we can use Loop instead of recursion。
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   Manually verified the change by running a job locally
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #2720: [HUDI-1719]hive on spark/mr,Incremental query of the mor table, the partition field is incorrect

2021-03-25 Thread GitBox



xiarixiaoyao commented on pull request #2720:
URL: https://github.com/apache/hudi/pull/2720#issuecomment-806562011


   test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1
   before patch:
   test step:
   
   step1:
   
   val df = spark.range(0, 1).toDF("keyid")
   .withColumn("col3", expr("keyid + 1000"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(6))
   .withColumn("a1", lit(Array[String] ("sb1", "rz")))
   .withColumn("a2", lit(Array[String] ("sb1", "rz")))
   
   // create hudi table which has three  level partitions p,p1,p2
   
   merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
   

   
   step2:
   
   val df = spark.range(0, 1).toDF("keyid")
   .withColumn("col3", expr("keyid + 1000"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit(Array[String] ("sb1", "rz")))
   .withColumn("a2", lit(Array[String] ("sb1", "rz")))
   
   // upsert current table
   
   merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
   
   hive beeline:
   
   set 
hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
   
   set hoodie.hive_8b.consume.mode=INCREMENTAL;
   
   set hoodie.hive_8b.consume.max.commits=3;
   
   set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp 
is smaller the earlist commit, so  we can query whole commits
   
   select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300' and `keyid` < 5;
   query result:
   ++-+-++ 
   | p  | p1  | p2  | keyid  | 
   ++-+-++ 
   | 0  | 0   | 6   | 0  | 
   | 0  | 0   | 6   | 1  | 
   | 0  | 0   | 6   | 2  | 
   | 0  | 0   | 6   | 3  | 
   | 0  | 0   | 6   | 4  | 
   | 0  | 0   | 6   | 4  | 
   | 0  | 0   | 6   | 0  | 
   | 0  | 0   | 6   | 3  | 
   | 0  | 0   | 6   | 2  | 
   | 0  | 0   | 6   | 1  | 
   ++-+-++
   this result is wrong, since the second step we insert new data in table 
partition p2=7, however in the query result we cannot find p2=7, all p2= 6
   
   After patch:
++-+-++
| p  | p1  | p2  | keyid  |
++-+-++
| 0  | 0   | 6   | 0  |
| 0  | 0   | 6   | 1  |
| 0  | 0   | 6   | 2  |
| 0  | 0   | 6   | 3  |
| 0  | 0   | 6   | 4  |
| 0  | 0   | 7   | 4  |
| 0  | 0   | 7   | 0  |
| 0  | 0   | 7   | 3  |
| 0  | 0   | 7   | 2  |
| 0  | 0   | 7   | 1  |
++-+-++
   this result is correct.  
   
   merge function:
   def merge(df: org.apache.spark.sql.DataFrame, par: Int, db: String, 
tableName: String,
   tableType: String = DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
   hivePartitionExtract: String = 
"org.apache.hudi.hive.MultiPartKeysValueExtractor", op: String = "upsert"): 
Unit = {
   val mode = if (op.equals("bulk_insert")) {
   Overwrite
   } else {
   Append
   }
   df.write.format("hudi").
   option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType).
   option(HoodieCompactionConfig.INLINE_COMPACT_PROP, false).
   option(PRECOMBINE_FIELD_OPT_KEY, "col3").
   option(RECORDKEY_FIELD_OPT_KEY, "keyid").
   option(PARTITIONPATH_FIELD_OPT_KEY, "p,p1,p2").
   option(DataSourceWriteOptions.OPERATION_OPT_KEY, op).
   option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, 
classOf[ComplexKeyGenerator].getName).
   option("hoodie.bulkinsert.shuffle.parallelism", par.toString).
   option("hoodie.metadata.enable", "false").
   option("hoodie.insert.shuffle.parallelism", par.toString).
   option("hoodie.upsert.shuffle.parallelism", par.toString).
   option("hoodie.delete.shuffle.parallelism", par.toString).
   option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true").
   option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "p,p1,p2").
   option("hoodie.datasource.hive_sync.support_timestamp", "true").
   option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true").
   option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").
   option(HIVE_USE_JDBC_OPT_KEY, "false").
   option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, db).
   option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName).
   option(TABLE_NAME, tableName).mode(mode).save(s"/tmp/${db}/${tableName}")
   }
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1719) hive on spark/mr,Incremental query of the mor table, the partition field is incorrect

2021-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1719:
-
Labels: pull-request-available  (was: )

> hive on spark/mr,Incremental query of the mor table, the partition field is 
> incorrect
> -
>
> Key: HUDI-1719
> URL: https://issues.apache.org/jira/browse/HUDI-1719
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.7.0, 0.8.0
> Environment: spark2.4.5, hadoop 3.1.1, hive 3.1.1
>Reporter: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the 
> mor table.
> when we have some small files in different partitions, 
> HoodieCombineHiveInputFormat  will combine those small file readers.   
> HoodieCombineHiveInputFormat  build partition field base on  the first file 
> reader in it, however now HoodieCombineHiveInputFormat  holds other file 
> readers which come from different partitions.
> When switching readers, we should  update ioctx
> test env:
> spark2.4.5, hadoop 3.1.1, hive 3.1.1
> test step:
> step1:
> val df = spark.range(0, 1).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(6))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // create hudi table which has three  level partitions p,p1,p2
> merge(df, 4, "default", "hive_8b", 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
>  
> step2:
> val df = spark.range(0, 1).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // upsert current table
> merge(df, 4, "default", "hive_8b", 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
> hive beeline:
> set 
> hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
> set hoodie.hive_8b.consume.mode=INCREMENTAL;
> set hoodie.hive_8b.consume.max.commits=3;
> set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp 
> is smaller the earlist commit, so  we can query whole commits
> select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
> `_hoodie_commit_time`>'20210325141300' and `keyid` < 5;
> query result:
> +-+++-+
> |p|p1|p2|keyid|
> +-+++-+
> |0|0|6|0|
> |0|0|6|1|
> |0|0|6|2|
> |0|0|6|3|
> |0|0|6|4|
> |0|0|6|4|
> |0|0|6|0|
> |0|0|6|3|
> |0|0|6|2|
> |0|0|6|1|
> +-+++-+
> this result is wrong, since the second step we insert new data in table which 
> p2=7, however in the query result we cannot find p2=7, all p2= 6
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] xiarixiaoyao opened a new pull request #2720: [HUDI-1719]hive on spark/mr,Incremental query of the mor table, the partition field is incorrect

2021-03-25 Thread GitBox



xiarixiaoyao opened a new pull request #2720:
URL: https://github.com/apache/hudi/pull/2720


   
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   fix the bug that, HoodieCombineHiveInputFormat cannot build correct 
partition field.
   
   now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of 
the mor table.
   when we have some small files in different partitions(file1 from partition 
p=6 , file2 from partition p=7), HoodieCombineHiveInputFormat  will combine 
those small file readers.   HoodieCombineHiveInputFormat  build partition field 
base on  the first file reader(assume file1 reader which from parition p=6) , 
however now HoodieCombineHiveInputFormat  holds other file readers(file2 reader 
which from paritition p=7) which come from different partitions.
   When switching readers, we should  update ioctx
   
https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieCombineRealtimeRecordReader.java#L73
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1721) run_sync_tool support hive3.1.2 on hadoop3.1.4

2021-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1721:
-
Labels: pull-request-available  (was: )

> run_sync_tool support hive3.1.2 on hadoop3.1.4 
> ---
>
> Key: HUDI-1721
> URL: https://issues.apache.org/jira/browse/HUDI-1721
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.8.0
>Reporter: 谢波
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/hudi/issues/2717]
> run_sync_tool support hive3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] MyLanPangzi opened a new pull request #2719: [HUDI-1721] run_sync_tool support hive3

2021-03-25 Thread GitBox



MyLanPangzi opened a new pull request #2719:
URL: https://github.com/apache/hudi/pull/2719


   run_sync_tool support hive3
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
run_sync_tool support hive3
   
   ## Brief change log
   
 - *Modify run_sync_tool.sh add HIVE_CALCITE and HIVE_LIBFB303 variable*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1721) run_sync_tool support hive3.1.2 on hadoop3.1.4

2021-03-25 Thread Jira

谢波 created HUDI-1721:


 Summary: run_sync_tool support hive3.1.2 on hadoop3.1.4 
 Key: HUDI-1721
 URL: https://issues.apache.org/jira/browse/HUDI-1721
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.8.0
Reporter: 谢波


[https://github.com/apache/hudi/issues/2717]

run_sync_tool support hive3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1719) hive on spark/mr,Incremental query of the mor table, the partition field is incorrect

2021-03-25 Thread tao meng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1719:
---
Description: 
now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the 
mor table.

when we have some small files in different partitions, 
HoodieCombineHiveInputFormat  will combine those small file readers.   
HoodieCombineHiveInputFormat  build partition field base on  the first file 
reader in it, however now HoodieCombineHiveInputFormat  holds other file 
readers which come from different partitions.

When switching readers, we should  update ioctx

test env:

spark2.4.5, hadoop 3.1.1, hive 3.1.1

test step:

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// create hudi table which has three  level partitions p,p1,p2

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

 

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// upsert current table

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

hive beeline:

set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

set hoodie.hive_8b.consume.mode=INCREMENTAL;

set hoodie.hive_8b.consume.max.commits=3;

set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is 
smaller the earlist commit, so  we can query whole commits

select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300' and `keyid` < 5;

query result:

+-+++-+
|p|p1|p2|keyid|

+-+++-+
|0|0|6|0|
|0|0|6|1|
|0|0|6|2|
|0|0|6|3|
|0|0|6|4|
|0|0|6|4|
|0|0|6|0|
|0|0|6|3|
|0|0|6|2|
|0|0|6|1|

+-+++-+

this result is wrong, since the second step we insert new data in table which 
p2=7, however in the query result we cannot find p2=7, all p2= 6

 

 

  was:
now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the 
mor table.

when we have some small files in different partitions, 
HoodieCombineHiveInputFormat  will combine those small file readers.   
HoodieCombineHiveInputFormat  build partition field base on  the first file 
reader in it, however now HoodieCombineHiveInputFormat  holds other file 
readers which come from different partitions.

When switching readers, we should  update ioctx

test env:

spark2.4.5, hadoop 3.1.1, hive 3.1.1

test step:

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// create hudi table which has three  level partitions p,p1,p2

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

 

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// upsert current table

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

hive beeline:

set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

set hoodie.hive_8b.consume.mode=INCREMENTAL;

set hoodie.hive_8b.consume.max.commits=3;

set hoodie.hive_8b.consume.start.timestamp=20210325141300;

select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300' and `keyid` < 5;

query result:

++-+-++ 
| p | p1 | p2 | keyid | 
++-+-++ 
| 0 | 0 | 6 | 0 | 
| 0 | 0 | 6 | 1 | 
| 0 | 0 | 6 | 2 | 
| 0 | 0 | 6 | 3 | 
| 0 | 0 | 6 | 4 | 
| 0 | 0 | 6 | 4 | 
| 0 | 0 | 6 | 0 | 
| 0 | 0 | 6 | 3 | 
| 0 | 0 | 6 | 2 | 
| 0 | 0 | 6 | 1 | 
++-+-++

this result is wrong, since the second step we insert new data in table which 
p2=7, however in the query result we cannot find p2=7, all p2= 6

 

 


> hive on spark/mr,Incremental query of the mor table, the partition field is 
> incorrect
> -
>
> Key: HUDI-1719
> URL: https://issues.apache.org/jira/browse/HUDI-1719
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive

[jira] [Updated] (HUDI-1495) Bump Flink version to 1.12.0

2021-03-25 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-1495:
-
Summary: Bump Flink version to 1.12.0  (was: Upgrade Flink version to 
1.12.0)

> Bump Flink version to 1.12.0
> 
>
> Key: HUDI-1495
> URL: https://issues.apache.org/jira/browse/HUDI-1495
> Project: Apache Hudi
>  Issue Type: Task
>  Components: newbie
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: easyfix, pull-request-available
> Fix For: 0.9.0
>
>
> The apache Flink 1.12.0 has be released, upgrade the version to 1.12.0 in 
> order to adapter new Flink interfaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] danny0405 opened a new pull request #2718: [HUDI-1495] Bump Flink version to 1.12.2

2021-03-25 Thread GitBox



danny0405 opened a new pull request #2718:
URL: https://github.com/apache/hudi/pull/2718


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] MyLanPangzi opened a new issue #2717: [SUPPORT] run_sync_tool support hive3.1.2 on hadoop3.1.4

2021-03-25 Thread GitBox



MyLanPangzi opened a new issue #2717:
URL: https://github.com/apache/hudi/issues/2717


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   **can we modify the run_sync_tool.sh ?** 
   
   **i have a hoodie table t1 on hadoop, and i want import into hive.
   when i execute import shell, get an error:**
   
   ./run_sync_tool.sh  \
   --user hive \
   --pass '' \
   --jdbc-url jdbc:hive2:\/\/yh001:1 \
   --partitioned-by partition \
   --base-path hdfs://yh001:9820/hudi/t1 \
   --database default \
   --table t1
   
   2021-03-25 18:28:27,011 INFO  [main] hive.HoodieHiveClient 
(HoodieHiveClient.java:(87)) - Creating hive connection 
jdbc:hive2://yh001:1
   2021-03-25 18:28:27,283 INFO  [main] hive.HoodieHiveClient 
(HoodieHiveClient.java:createHiveConnection(437)) - Successfully established 
Hive connection to  jdbc:hive2://yh001:1
   Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/calcite/plan/RelOptRule
   at 
org.apache.hudi.hive.HoodieHiveClient.(HoodieHiveClient.java:91)
   at org.apache.hudi.hive.HiveSyncTool.(HiveSyncTool.java:69)
   at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:249)
   Caused by: java.lang.ClassNotFoundException: 
org.apache.calcite.plan.RelOptRule
   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
   ... 3 more
   
   **after update the shell** 
   HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'`
   
   
HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON:$HIVE_CALCITE
   
   **i get anthor error**
   
   1-03-25 18:30:51,256 INFO  [main] hive.HoodieHiveClient 
(HoodieHiveClient.java:createHiveConnection(437)) - Successfully established 
Hive connection to  jdbc:hive2://yh001:1
   Exception in thread "main" java.lang.NoClassDefFoundError: 
com/facebook/fb303/FacebookService$Iface
   at java.lang.ClassLoader.defineClass1(Native Method)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
   at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
   at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
   at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:348)
   at 
org.apache.hadoop.hive.metastore.utils.JavaUtils.getClass(JavaUtils.java:52)
   at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:146)
   at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:119)
   at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:4299)
   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:4367)
   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:4347)
   at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:4603)
   at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:291)
   at 
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:274)
   at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:435)
   at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:375)
   at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:355)
   at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:331)
   at 
org.apache.hudi.hive.HoodieHiveClient.(HoodieHiveClient.java:91)
   at org.apache.hudi.hive.HiveSyncTool.(HiveSyncTool.java:69)
   at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:249)
   Caused by: java.lang.ClassNotFoundException: 
com.facebook.fb303.FacebookService$Iface
   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
   at

[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed

2021-03-25 Thread GitBox



xiarixiaoyao edited a comment on pull request #2716:
URL: https://github.com/apache/hudi/pull/2716#issuecomment-806535452


   test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1
   before patch:
   step1:
   
   val df = spark.range(0, 1).toDF("keyid")
   .withColumn("col3", expr("keyid + 1000"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(6))
   .withColumn("a1", lit( Array[String] ("sb1", "rz") ) )
   .withColumn("a2", lit( Array[String] ("sb1", "rz") ) )
   
   // bulk_insert df,   partition by p,p1,p2
   
 merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
   
   step2:
   
   val df = spark.range(0, 1).toDF("keyid")
   .withColumn("col3", expr("keyid + 1000"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit( Array[String] ("sb1", "rz") ) )
   .withColumn("a2", lit( Array[String] ("sb1", "rz") ) )
   
   // upsert table hive8b
   
   merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
   
   step3:
   
   start hive beeline:
   
   set 
hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
   
   set hoodie.hive_8b.consume.mode=INCREMENTAL;
   
   set hoodie.hive_8b.consume.max.commits=3;
   
   set hoodie.hive_8b.consume.start.timestamp=20210325141300;  // this 
timestamp is smaller the earlist commit, so  we can query whole commits
   
   select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300'
   
   2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
**p,p1,p2** 2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
**p,p1,p2** at 
org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at 
org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at 
org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268)
 at 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286)
 at org.apache.hudi.hadoop.realtime.AbstractReal
 timeRecordReader.init(AbstractRealtimeRecordReader.java:98) at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47)
 at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) 
at org.apache.hadoop.
 mapred.MapTask.runOldMapper(MapTask.java:444) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)
   
   after patch:
++-+-++
| p  | p1  | p2  | keyid  |
++-+-++
| 0  | 0   | 6   | 5068   |
| 0  | 0   | 6   | 6058   |
| 0  | 0   | 6   | 823|
| 0  | 0   | 6   | 5031   |
| 0  | 0   | 6   | 4445   |
| 0  | 0   | 6   | 5082   |
   
   
   merge function:
 def merge(df: org.apache.spark.sql.DataFrame, par: Int, db: String, 
tableName: String,
   tableType: String = 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
   hivePartitionExtract: String = 
"org.apache.hudi.hive.MultiPartKeysValueExtractor", op: String = "upsert"): 
Unit = {
   val mode = if (op.equals("bulk_insert")) {
 Overwrite
   } else {
 Append
   }
   df.write.format("hudi").
 option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType).
 option(HoodieCompactionConfig.INLINE_COMPACT_PROP, false).

[GitHub] [hudi] xiarixiaoyao commented on pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed

2021-03-25 Thread GitBox



xiarixiaoyao commented on pull request #2716:
URL: https://github.com/apache/hudi/pull/2716#issuecomment-806535452


   test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1
   before patch:
   step1:
   
   val df = spark.range(0, 1).toDF("keyid")
   .withColumn("col3", expr("keyid + 1000"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(6))
   .withColumn("a1", lit(Array[String]("sb1", "rz")))
   .withColumn("a2", lit(Array[String]("sb1", "rz")))
   
   // bulk_insert df,   partition by p,p1,p2
   
 merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
   
   step2:
   
   val df = spark.range(0, 1).toDF("keyid")
   .withColumn("col3", expr("keyid + 1000"))
   .withColumn("p", lit(0))
   .withColumn("p1", lit(0))
   .withColumn("p2", lit(7))
   .withColumn("a1", lit(Array[String]("sb1", "rz")))
   .withColumn("a2", lit(Array[String]("sb1", "rz")))
   
   // upsert table hive8b
   
   merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
   
   step3:
   
   start hive beeline:
   
   set 
hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
   
   set hoodie.hive_8b.consume.mode=INCREMENTAL;
   
   set hoodie.hive_8b.consume.max.commits=3;
   
   set hoodie.hive_8b.consume.start.timestamp=20210325141300;  // this 
timestamp is smaller the earlist commit, so  we can query whole commits
   
   select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300'
   
   2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
**p,p1,p2** 2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
**p,p1,p2** at 
org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at 
org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at 
org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268)
 at 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286)
 at org.apache.hudi.hadoop.realtime.AbstractReal
 timeRecordReader.init(AbstractRealtimeRecordReader.java:98) at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47)
 at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) 
at org.apache.hadoop.
 mapred.MapTask.runOldMapper(MapTask.java:444) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)
   
   after patch:
++-+-++
| p  | p1  | p2  | keyid  |
++-+-++
| 0  | 0   | 6   | 5068   |
| 0  | 0   | 6   | 6058   |
| 0  | 0   | 6   | 823|
| 0  | 0   | 6   | 5031   |
| 0  | 0   | 6   | 4445   |
| 0  | 0   | 6   | 5082   |
   
   
   merge function:
 def merge(df: org.apache.spark.sql.DataFrame, par: Int, db: String, 
tableName: String,
   tableType: String = 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
   hivePartitionExtract: String = 
"org.apache.hudi.hive.MultiPartKeysValueExtractor", op: String = "upsert"): 
Unit = {
   val mode = if (op.equals("bulk_insert")) {
 Overwrite
   } else {
 Append
   }
   df.write.format("hudi").
 option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType).
 option(HoodieCompactionConfig.INLINE_COMPACT_PROP, false).

[jira] [Updated] (HUDI-1718) when query incr view of mor table which has Multi level partitions, the query failed

2021-03-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1718:
-
Labels: pull-request-available  (was: )

> when query incr view of  mor table which has Multi level partitions, the 
> query failed
> -
>
> Key: HUDI-1718
> URL: https://issues.apache.org/jira/browse/HUDI-1718
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.7.0, 0.8.0
>Reporter: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive 
> use "/" to join muit1 partitions. there exists some gap, so modify 
> HoodieCombineHiveInputFormat's logical
>  test env
> spark2.4.5, hadoop 3.1.1, hive 3.1.1
>  
> step1:
> val df = spark.range(0, 1).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(6))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // bulk_insert df,   partition by p,p1,p2
>   merge(df, 4, "default", "hive_8b", 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> step2:
> val df = spark.range(0, 1).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // upsert table hive8b
> merge(df, 4, "default", "hive_8b", 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
> step3:
> start hive beeline:
> set 
> hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
> set hoodie.hive_8b.consume.mode=INCREMENTAL;
> set hoodie.hive_8b.consume.max.commits=3;
> set hoodie.hive_8b.consume.start.timestamp=20210325141300;  // this timestamp 
> is smaller the earlist commit, so  we can query whole commits
> select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
> `_hoodie_commit_time`>'20210325141300'
>  
> 2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | Diagnostics 
> report from attempt_1615883368881_0028_m_00_3: Error: 
> org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
> p,p1,p2 2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | 
> Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: 
> org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
> p,p1,p2 at 
> org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at 
> org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at 
> org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at 
> org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at 
> org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268)
>  at 
> org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286)
>  at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:98)
>  at 
> org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53)
>  at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70)
>  at 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47)
>  at 
> org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123)
>  at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975)
>  at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556)
>  at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) 
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
> org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] xiarixiaoyao opened a new pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed

2021-03-25 Thread GitBox



xiarixiaoyao opened a new pull request #2716:
URL: https://github.com/apache/hudi/pull/2716


   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   HoodieCombineHiveInputFormat use "," to join multi partitions, however hive 
use "/" to join multi partitions. there exists some gap, so modify 
HoodieCombineHiveInputFormat's logical
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   This pull request is already covered by existing tests
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1718) when query incr view of mor table which has Multi level partitions, the query failed

2021-03-25 Thread tao meng (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1718:
---
Description: 
HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use 
"/" to join muit1 partitions. there exists some gap, so modify 
HoodieCombineHiveInputFormat's logical
 test env

spark2.4.5, hadoop 3.1.1, hive 3.1.1

 

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// bulk_insert df,   partition by p,p1,p2

  merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// upsert table hive8b

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

step3:

start hive beeline:

set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

set hoodie.hive_8b.consume.mode=INCREMENTAL;

set hoodie.hive_8b.consume.max.commits=3;

set hoodie.hive_8b.consume.start.timestamp=20210325141300;  // this timestamp 
is smaller the earlist commit, so  we can query whole commits

select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300'

 

2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | Diagnostics 
report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
p,p1,p2 2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
p,p1,p2 at 
org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at 
org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at 
org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268)
 at 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286)
 at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:98)
 at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47)
 at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) 
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)

 

  was:
HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use 
"/" to join muit1 partitions. there exists some gap, so modify 
HoodieCombineHiveInputFormat's logical
test env

spark2.4.5, hadoop 3.1.1, hive 3.1.1

 

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// bulk_insert df,   partition by p,p1,p2

  merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))

[jira] [Created] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread tao meng (Jira)

tao meng created HUDI-1720:
--

 Summary: when query incr view of  mor table which has many delete 
records use sparksql/hive-beeline,  StackOverflowError
 Key: HUDI-1720
 URL: https://issues.apache.org/jira/browse/HUDI-1720
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration, Spark Integration
Affects Versions: 0.7.0, 0.8.0
Reporter: tao meng
 Fix For: 0.9.0


 now RealtimeCompactedRecordReader.next   deal with delete record by recursion, 
see:

[https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106]

however when the log file contains many delete record,  the logcial of 
RealtimeCompactedRecordReader.next  will lead stackOverflowError

test step:

step1:

val df = spark.range(0, 100).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// bulk_insert 100w row (keyid from 0 to 100)

merge(df, 4, "default", "hive_9b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

step2:

val df = spark.range(0, 90).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// delete 90w row (keyid from 0 to 90)

delete(df, 4, "default", "hive_9b")

step3:

query on beeline/spark-sql :  select count(col3)  from hive_9b_rt

2021-03-25 15:33:29,029 | INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, 
RECORDS_OUT_INTERMEDIATE:1,  | Operator.java:10382021-03-25 15:33:29,029 | INFO 
 | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1,  | 
Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running child 
: java.lang.StackOverflowError at 
org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at 
org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39)
 at 
org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344)
 at 
org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503)
 at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
 at 
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409)
 at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
 at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601290387



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -193,6 +195,14 @@ public String getPreCombineField() {
 return props.getProperty(HOODIE_TABLE_PRECOMBINE_FIELD);
   }
 
+  public Option getPartitionColumns() {

Review comment:
   For the previous version of hudi, There are no partition column stored 
in the hoodie.properties. So It return an `Option#empty` for this case. We can 
distinguish the case of empty partitions and not store partition columns. I saw 
other property like `getBootstrapBasePath` also use the Option.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1719) hive on spark/mr,Incremental query of the mor table, the partition field is incorrect

2021-03-25 Thread tao meng (Jira)

tao meng created HUDI-1719:
--

 Summary: hive on spark/mr,Incremental query of the mor table, the 
partition field is incorrect
 Key: HUDI-1719
 URL: https://issues.apache.org/jira/browse/HUDI-1719
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.7.0, 0.8.0
 Environment: spark2.4.5, hadoop 3.1.1, hive 3.1.1
Reporter: tao meng
 Fix For: 0.9.0


now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the 
mor table.

when we have some small files in different partitions, 
HoodieCombineHiveInputFormat  will combine those small file readers.   
HoodieCombineHiveInputFormat  build partition field base on  the first file 
reader in it, however now HoodieCombineHiveInputFormat  holds other file 
readers which come from different partitions.

When switching readers, we should  update ioctx

test env:

spark2.4.5, hadoop 3.1.1, hive 3.1.1

test step:

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// create hudi table which has three  level partitions p,p1,p2

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

 

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// upsert current table

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

hive beeline:

set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

set hoodie.hive_8b.consume.mode=INCREMENTAL;

set hoodie.hive_8b.consume.max.commits=3;

set hoodie.hive_8b.consume.start.timestamp=20210325141300;

select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300' and `keyid` < 5;

query result:

++-+-++ 
| p | p1 | p2 | keyid | 
++-+-++ 
| 0 | 0 | 6 | 0 | 
| 0 | 0 | 6 | 1 | 
| 0 | 0 | 6 | 2 | 
| 0 | 0 | 6 | 3 | 
| 0 | 0 | 6 | 4 | 
| 0 | 0 | 6 | 4 | 
| 0 | 0 | 6 | 0 | 
| 0 | 0 | 6 | 3 | 
| 0 | 0 | 6 | 2 | 
| 0 | 0 | 6 | 1 | 
++-+-++

this result is wrong, since the second step we insert new data in table which 
p2=7, however in the query result we cannot find p2=7, all p2= 6

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1718) when query incr view of mor table which has Multi level partitions, the query failed

2021-03-25 Thread tao meng (Jira)

tao meng created HUDI-1718:
--

 Summary: when query incr view of  mor table which has Multi level 
partitions, the query failed
 Key: HUDI-1718
 URL: https://issues.apache.org/jira/browse/HUDI-1718
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.7.0, 0.8.0
Reporter: tao meng
 Fix For: 0.9.0


HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use 
"/" to join muit1 partitions. there exists some gap, so modify 
HoodieCombineHiveInputFormat's logical
test env

spark2.4.5, hadoop 3.1.1, hive 3.1.1

 

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// bulk_insert df,   partition by p,p1,p2

  merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// upsert table hive8b

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

step3:

start hive beeline:

set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

set hoodie.hive_8b.consume.mode=INCREMENTAL;

set hoodie.hive_8b.consume.max.commits=3;

set hoodie.hive_8b.consume.start.timestamp=20210325141300;  // this timestamp 
is smaller the earlist commit, so  we can query whole commits

2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | Diagnostics 
report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
p,p1,p2 2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
p,p1,p2 at 
org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at 
org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at 
org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268)
 at 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286)
 at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:98)
 at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47)
 at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) 
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)

select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300'

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] pengzhiwei2018 commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-806486974


   > @pengzhiwei2018 Thank you for your reply。
   > spark3 has it's own analyze logic and optimize logic. and those logicals 
are continual enhanced.
   > for example:
   > 
[apache/spark@d99135b](https://github.com/apache/spark/commit/d99135b66ab4ebab69973800990e12845c4ffeaa)
   > 
[apache/spark@1ad3432](https://github.com/apache/spark/commit/1ad343238cb82a79e51ae9d46ae704bc482ff437)
   > 
[apache/spark@4d56d43](https://github.com/apache/spark/commit/4d56d438386049b5f481ec83b69e3c89807be201)
   > and so on.
   > i think it will be better to respect spark3's logcials.
   
   Thanks for the input for me @xiarixiaoyao . I will take a look at this~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



xiarixiaoyao commented on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-806485150


   @pengzhiwei2018 Thank you for your reply。
   spark3 has it's own analyze logic and optimize logic.  and those logicals 
are continual enhanced.
   for example:
   
https://github.com/apache/spark/commit/d99135b66ab4ebab69973800990e12845c4ffeaa
   
https://github.com/apache/spark/commit/1ad343238cb82a79e51ae9d46ae704bc482ff437
   
https://github.com/apache/spark/commit/4d56d438386049b5f481ec83b69e3c89807be201
   and so on.
 i think it will be better to respect spark3's logcials.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601225876



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -79,39 +81,52 @@ class DefaultSource extends RelationProvider
 val allPaths = path.map(p => Seq(p)).getOrElse(Seq()) ++ readPaths
 
 val fs = FSUtils.getFs(allPaths.head, 
sqlContext.sparkContext.hadoopConfiguration)
-val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)
-
-val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
+// Use the HoodieFileIndex only if the 'path' has specified with no "*" 
contains.
+// And READ_PATHS_OPT_KEY has not specified.
+// Or else we use the original way to read hoodie table.
+val useHoodieFileIndex = path.isDefined && !path.get.contains("*") &&

Review comment:
   Hi @vinothchandar , Currently user query the hoodie table must specify 
some stars at the path.  If the path has not specified the stars, the old way 
of query hoodie table may not work(exception or get no data). So we must use 
the  `HoodieFileIndex` for this case by default.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] aditiwari01 edited a comment on issue #2675: [SUPPORT] Unable to query MOR table after schema evolution

2021-03-25 Thread GitBox



aditiwari01 edited a comment on issue #2675:
URL: https://github.com/apache/hudi/issues/2675#issuecomment-806450032


   > * do you use the RowBasedSchemaProvider and hence can't explicitly provide 
schema? If you were to use your own schema registry, you might as well provide 
an updated schema to hudi while writing.
   > * got it. would be nice to have some contribution. I can help review the 
patch.
   >   In the mean time, I will give it a try schema evolution on my end with 
some local set up.
   
   1. Yes, RowBasedSchemaProvider  is used in creating GenericRecord. Not sure 
on how to pass row schema with defaults to spark writer (if at all its 
possible.). Also can't fine any way to pass custom SchemaProvoder class in 
writer options.
   
   2. The idea for path is to update to avro schema with default of null for 
all nullable fields. If you wish you can refer function 
**getAvroSchemaWithDefs** in my local branch: 
https://github.com/aditiwari01/hudi/commit/68a7ad20cc570f88940f2481d0ae19986254f1d8#diff-21ccee08cebace9de801e104f356bf4333c017d5d5c0cac4e3de63c7861f3c13R58
   
   This is bit unstructured though. Will try to have a formal PR with tests and 
all by next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601225876



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -79,39 +81,52 @@ class DefaultSource extends RelationProvider
 val allPaths = path.map(p => Seq(p)).getOrElse(Seq()) ++ readPaths
 
 val fs = FSUtils.getFs(allPaths.head, 
sqlContext.sparkContext.hadoopConfiguration)
-val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)
-
-val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
+// Use the HoodieFileIndex only if the 'path' has specified with no "*" 
contains.
+// And READ_PATHS_OPT_KEY has not specified.
+// Or else we use the original way to read hoodie table.
+val useHoodieFileIndex = path.isDefined && !path.get.contains("*") &&

Review comment:
   Hi @vinothchandar , Currently user query the hoodie table must specify 
some "*" at the path.  If the path has not specified the "*", the old way of 
query hoodie table may not work(exception or get no data). So we must use the  
`HoodieFileIndex` for this case by default.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601220471



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -112,12 +112,15 @@ private[hudi] object HoodieSparkSqlWriter {
 val archiveLogFolder = parameters.getOrElse(
   HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, "archived")
 
+val partitionColumns = 
parameters.getOrElse(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, null)

Review comment:
   Hi @vinothchandar , we need get the partition schema and pass it to the 
`HadoopFsRelation`,just like this:
   
   HadoopFsRelation(
   fileIndex,
   fileIndex.partitionSchema,
   fileIndex.dataSchema,
   bucketSpec = None,
   fileFormat = new ParquetFileFormat,
   optParams)(sqlContext.sparkSession)
   
   By the generator, it is hard to construct the partition schema. So we need 
the partition columns.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601220471



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -112,12 +112,15 @@ private[hudi] object HoodieSparkSqlWriter {
 val archiveLogFolder = parameters.getOrElse(
   HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, "archived")
 
+val partitionColumns = 
parameters.getOrElse(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, null)

Review comment:
   Hi @vinothchandar , we need get the partition schema and pass it to the 
`HadoopFsRelation`,like this:
   
   HadoopFsRelation(
   fileIndex,
   fileIndex.partitionSchema,
   fileIndex.dataSchema,
   bucketSpec = None,
   fileFormat = new ParquetFileFormat,
   optParams)(sqlContext.sparkSession)
   
   By the generator, it is hard to construct the partition schema. So we need 
the partition columns.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601220471



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -112,12 +112,15 @@ private[hudi] object HoodieSparkSqlWriter {
 val archiveLogFolder = parameters.getOrElse(
   HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, "archived")
 
+val partitionColumns = 
parameters.getOrElse(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, null)

Review comment:
   Hi @vinothchandar , we need get the partition schema and pass it to the 
`HadoopFsRelation`,like this:
   
   HadoopFsRelation(
   fileIndex,
   fileIndex.partitionSchema,
   fileIndex.dataSchema,
   bucketSpec = None,
   fileFormat = new ParquetFileFormat,
   optParams)(sqlContext.sparkSession)
   By the generator, it is hard to construct the partition schema. So we need 
the partition columns.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601214645



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -57,6 +58,7 @@
   public static final String HOODIE_TABLE_TYPE_PROP_NAME = "hoodie.table.type";
   public static final String HOODIE_TABLE_VERSION_PROP_NAME = 
"hoodie.table.version";
   public static final String HOODIE_TABLE_PRECOMBINE_FIELD = 
"hoodie.table.precombine.field";
+  public static final String HOODIE_TABLE_PARTITION_COLUMNS = 
"hoodie.table.partition.columns";

Review comment:
   Hi @vinothchandar , we need the partition schema for partition prune for 
spark sql. So the partition columns is need for that. Or we cannot get the 
partition schema.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601213171



##
File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
##
@@ -276,6 +276,13 @@ public static void processFiles(FileSystem fs, String 
basePathStr, Function

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601212827



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##
@@ -0,0 +1,349 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.{HoodieMetadataConfig, 
SerializableConfiguration}
+import org.apache.hudi.common.engine.HoodieLocalEngineContext
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.{InternalRow, expressions}
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.avro.SchemaConverters
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BoundReference, Expression, InterpretedPredicate}
+import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils}
+import org.apache.spark.sql.execution.datasources.{FileIndex, FileStatusCache, 
NoopCache, PartitionDirectory, PartitionUtils}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+import scala.collection.mutable
+
+/**
+  * A File Index which support partition prune for hoodie snapshot and 
read-optimized
+  * query.
+  * Main steps to get the file list for query:
+  * 1、Load all files and partition values from the table path.
+  * 2、Do the partition prune by the partition filter condition.
+  *
+  * There are 3 cases for this:
+  * 1、If the partition columns size is equal to the actually partition path 
level, we
+  * read it as partitioned table.(e.g partition column is "dt", the partition 
path is "2021-03-10")
+  *
+  * 2、If the partition columns size is not equal to the partition path level, 
but the partition
+  * column size is "1" (e.g. partition column is "dt", but the partition path 
is "2021/03/10"
+  * who'es directory level is 3).We can still read it as a partitioned table. 
We will mapping the
+  * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt").
+  *
+  * 3、Else the the partition columns size is not equal to the partition 
directory level and the
+  * size is great than "1" (e.g. partition column is "dt,hh", the partition 
path is "2021/03/10/12")
+  * , we read it as a None Partitioned table because we cannot know how to 
mapping the partition
+  * path with the partition columns in this case.
+  */
+case class HoodieFileIndex(
+ spark: SparkSession,
+ metaClient: HoodieTableMetaClient,
+ schemaSpec: Option[StructType],
+ options: Map[String, String],
+ @transient fileStatusCache: FileStatusCache = NoopCache)
+  extends FileIndex with Logging {
+
+  private val basePath = metaClient.getBasePath
+
+  @transient private val queryPath = new Path(options.getOrElse("path", 
"'path' option required"))
+  /**
+* Get the schema of the table.
+*/
+  lazy val schema: StructType = schemaSpec.getOrElse({
+val schemaUtil = new TableSchemaResolver(metaClient)
+SchemaConverters.toSqlType(schemaUtil.getTableAvroSchema)
+  .dataType.asInstanceOf[StructType]
+  })
+
+  /**
+* Get the partition schema from the hoodie.properties.
+*/
+  private lazy val _partitionSchemaFromProperties: StructType = {
+val tableConfig = metaClient.getTableConfig
+val partitionColumns = tableConfig.getPartitionColumns
+val nameFieldMap = schema.fields.map(filed => filed.name -> filed).toMap
+
+if (partitionColumns.isPresent) {
+  val partitionFields = partitionColumns.get().map(column =>
+nameFieldMap.getOrElse(column, throw new 
IllegalArgumentException(s"Cannot find column: '" +
+

[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 edited a comment on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-806465222


   > Let me see how/if we can simplify the inputSchema vs writeSchema thing.
   > 
   > I went over the PR now. LGTM at a high level.
   > Few questions though
   > 
   > * I see we are introducing some antlr parsing and inject a custom parse 
for spark 2.x. Is this done for backwards compat with Spark 2 and will be 
eventually removed?
   > * Do we reuse the MERGE/DELETE keywords from Spark 3? Is Spark 3 and Spark 
2 syntax different. Can you comment on how we are approaching all this.
   > * Have you done any production testing of this PR?
   > 
   > cc @kwondw could you also please chime in. We would like to land something 
basic and iterate and get this out for 0.9.0 next month.
   
   Thanks for you review @vinothchandar !
   
   > I see we are introducing some antlr parsing and inject a custom parse for 
spark 2.x. Is this done for backwards compat with Spark 2 and will be 
eventually removed?
   
   Yes, It is for backwards for Spark2 and will be eventually removed for 
spark3 if there are no other syntax extend for the spark3.
   
   > Do we reuse the MERGE/DELETE keywords from Spark 3? Is Spark 3 and Spark 2 
syntax different. Can you comment on how we are approaching all this.
   
   Yes ,I reused the extended syntax( MERGE) from spark 3. So they are 
the same between spark2 and spark3 in the syntax. 
   For spark3, spark can recognize the MERGE/DELTE syntax and parser it to 
LogicalPlan. For spark2, our extended sql parser will also parser it to the 
some LogicalPlan. After the parser, the LogicalPlan will goes to the same 
Rules（In `HoodieAnalysis`) to resolve and rewrite to  Hoodie Command. Hoodie 
Command will translate the logical plan to the hoodie api call.The Hoodie 
Command shared between spark2 & spark 3.
   So except the sql parser for spark2, other parts can share between spark2 & 
spark3.
   
   > Have you done any production testing of this PR?
   
   Yes, I have test it in Aliyun's EMR cluster.  And more test case will be 
done this week.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-806471861


   > @pengzhiwei2018 it's a great work, thanks. MERGE/DELETE 
expresssions(MergeIntoTable, MergeAction... ) in your pr is copy from 
v2Commands.scala in spark 3.0 which will lead to class conflict if we implement 
sql suppor for spark3 later. could you shade those keywords
   > 
   > @vinothchandar
   > MERGE/DELETE keywords from Spark 3 is incompatible with this pr， however 
it's not a problem,we can introduce a extra module hudi-spark3-extensions to 
resolve those incompatible, i will put forward a new pr for spark3 next few 
days。
   
   Hi @xiarixiaoyao , Thanks for the review!
   There is already exists a `hudi-spark3` module in the hudi project.  I think 
we should not need to add a hudi-spark3-extensions.
   The `MergeIntoTable`  has the same package and class name with the that in 
spark3, So that they can use  the same rules `HoodieAnalysis` for resolve and 
rewrite. I think it will not conflict with the spark3  because we will not 
build the spark2 module together with the spark3. Only one `MergeIntoTable` 
will package in the bundle jar.
   
   > MERGE/DELETE keywords from Spark 3 is incompatible with this pr
   
   I am not understand the incompatible. They have the same syntax and logical 
plan between spark2 and spark3. Can your help me out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-806465222


   > Let me see how/if we can simplify the inputSchema vs writeSchema thing.
   > 
   > I went over the PR now. LGTM at a high level.
   > Few questions though
   > 
   > * I see we are introducing some antlr parsing and inject a custom parse 
for spark 2.x. Is this done for backwards compat with Spark 2 and will be 
eventually removed?
   > * Do we reuse the MERGE/DELETE keywords from Spark 3? Is Spark 3 and Spark 
2 syntax different. Can you comment on how we are approaching all this.
   > * Have you done any production testing of this PR?
   > 
   > cc @kwondw could you also please chime in. We would like to land something 
basic and iterate and get this out for 0.9.0 next month.
   
   Thanks for you review @vinothchandar !
   
   > I see we are introducing some antlr parsing and inject a custom parse for 
spark 2.x. Is this done for backwards compat with Spark 2 and will be 
eventually removed?
   
   Yes, It is for backwards for Spark2 and will be eventually removed for 
spark3 if there are no other syntax extend for the spark3.
   
   > Do we reuse the MERGE/DELETE keywords from Spark 3? Is Spark 3 and Spark 2 
syntax different. Can you comment on how we are approaching all this.
   
   Yes ,I reused the extended syntax( MERGE) from spark 3. So they are 
the same between spark2 and spark3 in the syntax. 
   For spark3, spark can recognize the MERGE/DELTE syntax and parser it to 
LogicalPlan. For spark2, our extended sql parser will also parser it to the 
some LogicalPlan. After the parser, the LogicalPlan will goes to the same 
Rules（In `HoodieAnalysis`) to resolve and rewrite to  Hoodie Command. Hoodie 
Command will translate the logical plan to the hoodie api call.The Hoodie 
Command shared between spark2 & spark 3.
   So except the sql parser for spark2, other parts can share between spark2 & 
spark3.
   
   > Have you done any production testing of this PR?
   
   Yes, I have test it in EMR cluster.  And more test case will be done this 
week.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r601159174



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/HoodieBaseSqlTest.scala
##
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import java.io.File
+
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.util.Utils
+import org.scalactic.source
+import org.scalatest.{BeforeAndAfterAll, FunSuite, Tag}
+
+class HoodieBaseSqlTest extends FunSuite with BeforeAndAfterAll {

Review comment:
   Yes, good suggestions!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2710: [RFC-20][HUDI-648] Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes

2021-03-25 Thread GitBox



codecov-io edited a comment on pull request #2710:
URL: https://github.com/apache/hudi/pull/2710#issuecomment-806393474


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=h1) Report
   > Merging 
[#2710](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=desc) (f9e9a12) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/d7b18783bdd6edd6355ee68714982401d3321f86?el=desc)
 (d7b1878) will **decrease** coverage by `0.12%`.
   > The diff coverage is `6.25%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2710/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2710  +/-   ##
   
   - Coverage 51.76%   51.63%   -0.13% 
   - Complexity 3601 3603   +2 
   
 Files   476  478   +2 
 Lines 2258322641  +58 
 Branches   2408 2417   +9 
   
   + Hits  1168911691   +2 
   - Misses 9877 9928  +51 
   - Partials   1017 1022   +5 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.73% <2.22%> (-0.21%)` | `0.00 <1.00> (ø)` | |
   | hudiflink | `54.08% <ø> (-0.20%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.42% <66.66%> (-0.02%)` | `0.00 <0.00> (ø)` | |
   | hudisparkdatasource | `70.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `45.58% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.73% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...che/hudi/common/config/HoodieErrorTableConfig.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Ib29kaWVFcnJvclRhYmxlQ29uZmlnLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...on/model/OverwriteWithLatestAvroSchemaPayload.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZVdpdGhMYXRlc3RBdnJvU2NoZW1hUGF5bG9hZC5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...va/org/apache/hudi/common/util/TablePathUtils.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvVGFibGVQYXRoVXRpbHMuamF2YQ==)
 | `59.52% <0.00%> (-8.05%)` | `17.00 <1.00> (+1.00)` | :arrow_down: |
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh)
 | `57.79% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/hadoop/HoodieROTablePathFilter.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0hvb2RpZVJPVGFibGVQYXRoRmlsdGVyLmphdmE=)
 | `64.19% <66.66%> (-0.81%)` | `14.00 <0.00> (ø)` | |
   | 
[...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh)
 | `68.44% <100.00%> (+0.12%)` | `43.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/table/HoodieTableFactory.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZUZhY3RvcnkuamF2YQ==)
 | `72.72% <0.00%> (-5.33%)` | `11.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh)
 | `61.44% <0.00%> (-3.53%)` | `28.00% <0.00%> (ø%)` | |
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | |
   | ... and [2 
more](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree-more) | |

[GitHub] [hudi] vinothchandar commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



vinothchandar commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601183166



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##
@@ -0,0 +1,317 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.{HoodieMetadataConfig, 
SerializableConfiguration}
+import org.apache.hudi.common.engine.HoodieLocalEngineContext
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.HoodieBaseFile
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.{InternalRow, expressions}
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.avro.SchemaConverters
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
BoundReference, Expression, InterpretedPredicate}
+import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils}
+import org.apache.spark.sql.execution.datasources.{FileIndex, 
PartitionDirectory, PartitionUtils}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+  * A File Index which support partition prune for hoodie snapshot and 
read-optimized
+  * query.
+  * Main steps to get the file list for query:
+  * 1、Load all files and partition values from the table path.
+  * 2、Do the partition prune by the partition filter condition.
+  *
+  * There are 3 cases for this:
+  * 1、If the partition columns size is equal to the actually partition path 
level, we
+  * read it as partitioned table.(e.g partition column is "dt", the partition 
path is "2021-03-10")
+  *
+  * 2、If the partition columns size is not equal to the partition path level, 
but the partition
+  * column size is "1" (e.g. partition column is "dt", but the partition path 
is "2021/03/10"
+  * who'es directory level is 3).We can still read it as a partitioned table. 
We will mapping the
+  * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt").
+  *
+  * 3、Else the the partition columns size is not equal to the partition 
directory level and the
+  * size is great than "1" (e.g. partition column is "dt,hh", the partition 
path is "2021/03/10/12")
+  * , we read it as a None Partitioned table because we cannot know how to 
mapping the partition
+  * path with the partition columns in this case.
+  */
+case class HoodieFileIndex(
+ spark: SparkSession,
+ basePath: String,
+ schemaSpec: Option[StructType],
+ options: Map[String, String])
+  extends FileIndex with Logging {
+
+  @transient private val hadoopConf = spark.sessionState.newHadoopConf()
+  private lazy val metaClient = HoodieTableMetaClient
+.builder().setConf(hadoopConf).setBasePath(basePath).build()
+
+  @transient private val queryPath = new Path(options.getOrElse("path", 
"'path' option required"))
+  /**
+* Get the schema of the table.
+*/
+  lazy val schema: StructType = schemaSpec.getOrElse({
+val schemaUtil = new TableSchemaResolver(metaClient)
+SchemaConverters.toSqlType(schemaUtil.getTableAvroSchema)
+  .dataType.asInstanceOf[StructType]
+  })
+
+  /**
+* Get the partition schema from the hoodie.properties.
+*/
+  private lazy val _partitionSchemaFromProperties: StructType = {
+val tableConfig = metaClient.getTableConfig
+val partitionColumns = tableConfig.getPartitionColumns
+val nameFieldMap = schema.fields.map(filed => filed.name -> filed).toMap
+
+if (partitionColumns.isPresent) {
+  val partitionFields = partitionColumns.get().map(column =>
+nameFieldMap.getOrElse(column, throw new 
IllegalArgumentException(s"Cannot find column " +

[GitHub] [hudi] vinothchandar commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox



vinothchandar commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r601170320



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -57,6 +58,7 @@
   public static final String HOODIE_TABLE_TYPE_PROP_NAME = "hoodie.table.type";
   public static final String HOODIE_TABLE_VERSION_PROP_NAME = 
"hoodie.table.version";
   public static final String HOODIE_TABLE_PRECOMBINE_FIELD = 
"hoodie.table.precombine.field";
+  public static final String HOODIE_TABLE_PARTITION_COLUMNS = 
"hoodie.table.partition.columns";

Review comment:
   I think we should persist the key generator class. and not the partition 
columns themselves? let me think over this more. 

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -193,6 +195,14 @@ public String getPreCombineField() {
 return props.getProperty(HOODIE_TABLE_PRECOMBINE_FIELD);
   }
 
+  public Option getPartitionColumns() {

Review comment:
   why not use an empty array to signify no partition columns?

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -79,39 +81,52 @@ class DefaultSource extends RelationProvider
 val allPaths = path.map(p => Seq(p)).getOrElse(Seq()) ++ readPaths
 
 val fs = FSUtils.getFs(allPaths.head, 
sqlContext.sparkContext.hadoopConfiguration)
-val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)
-
-val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
+// Use the HoodieFileIndex only if the 'path' has specified with no "*" 
contains.
+// And READ_PATHS_OPT_KEY has not specified.
+// Or else we use the original way to read hoodie table.
+val useHoodieFileIndex = path.isDefined && !path.get.contains("*") &&

Review comment:
   can we guard this feature itself with a data source option as well. i.e 
   
   `val useHoodieFileIndex = hoodieFileIndexEnabled && path.isDefined && 
!path.get.contains("*") && ..` 
   

##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -112,12 +112,15 @@ private[hudi] object HoodieSparkSqlWriter {
 val archiveLogFolder = parameters.getOrElse(
   HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, "archived")
 
+val partitionColumns = 
parameters.getOrElse(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, null)

Review comment:
   the partition path can also be generated from a key generator, right? we 
need to think this case through more. 

##
File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
##
@@ -276,6 +276,13 @@ public static void processFiles(FileSystem fs, String 
basePathStr, Function

[GitHub] [hudi] codecov-io edited a comment on pull request #2710: [RFC-20][HUDI-648] Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes

2021-03-25 Thread GitBox



codecov-io edited a comment on pull request #2710:
URL: https://github.com/apache/hudi/pull/2710#issuecomment-806393474


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=h1) Report
   > Merging 
[#2710](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=desc) (f9e9a12) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/d7b18783bdd6edd6355ee68714982401d3321f86?el=desc)
 (d7b1878) will **decrease** coverage by `0.12%`.
   > The diff coverage is `6.25%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2710/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2710  +/-   ##
   
   - Coverage 51.76%   51.63%   -0.13% 
   - Complexity 3601 3603   +2 
   
 Files   476  478   +2 
 Lines 2258322641  +58 
 Branches   2408 2417   +9 
   
   + Hits  1168911691   +2 
   - Misses 9877 9928  +51 
   - Partials   1017 1022   +5 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.73% <2.22%> (-0.21%)` | `0.00 <1.00> (ø)` | |
   | hudiflink | `54.08% <ø> (-0.20%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.42% <66.66%> (-0.02%)` | `0.00 <0.00> (ø)` | |
   | hudisparkdatasource | `70.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `45.58% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.73% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...che/hudi/common/config/HoodieErrorTableConfig.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Ib29kaWVFcnJvclRhYmxlQ29uZmlnLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...on/model/OverwriteWithLatestAvroSchemaPayload.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZVdpdGhMYXRlc3RBdnJvU2NoZW1hUGF5bG9hZC5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...va/org/apache/hudi/common/util/TablePathUtils.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvVGFibGVQYXRoVXRpbHMuamF2YQ==)
 | `59.52% <0.00%> (-8.05%)` | `17.00 <1.00> (+1.00)` | :arrow_down: |
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh)
 | `57.79% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/hadoop/HoodieROTablePathFilter.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0hvb2RpZVJPVGFibGVQYXRoRmlsdGVyLmphdmE=)
 | `64.19% <66.66%> (-0.81%)` | `14.00 <0.00> (ø)` | |
   | 
[...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh)
 | `68.44% <100.00%> (+0.12%)` | `43.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/table/HoodieTableFactory.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZUZhY3RvcnkuamF2YQ==)
 | `72.72% <0.00%> (-5.33%)` | `11.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh)
 | `61.44% <0.00%> (-3.53%)` | `28.00% <0.00%> (ø%)` | |
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | |
   | ... and [2 
more](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree-more) | |

[GitHub] [hudi] aditiwari01 edited a comment on issue #2675: [SUPPORT] Unable to query MOR table after schema evolution

2021-03-25 Thread GitBox



aditiwari01 edited a comment on issue #2675:
URL: https://github.com/apache/hudi/issues/2675#issuecomment-806450032


   > * do you use the RowBasedSchemaProvider and hence can't explicitly provide 
schema? If you were to use your own schema registry, you might as well provide 
an updated schema to hudi while writing.
   > * got it. would be nice to have some contribution. I can help review the 
patch.
   >   In the mean time, I will give it a try schema evolution on my end with 
some local set up.
   
   1. Not sure on how to pass row schema with defaults to spark writer (if at 
all its possible.)
   
   2. The idea for path is to update to avro schema with default of null for 
all nullable fields. If you wish you can refer function 
**getAvroSchemaWithDefs** in my local branch: 
https://github.com/aditiwari01/hudi/commit/68a7ad20cc570f88940f2481d0ae19986254f1d8#diff-21ccee08cebace9de801e104f356bf4333c017d5d5c0cac4e3de63c7861f3c13R58
   
   This is bit unstructured though. Will try to have a formal PR with tests and 
all by next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] aditiwari01 commented on issue #2675: [SUPPORT] Unable to query MOR table after schema evolution

2021-03-25 Thread GitBox



aditiwari01 commented on issue #2675:
URL: https://github.com/apache/hudi/issues/2675#issuecomment-806450032


   > * do you use the RowBasedSchemaProvider and hence can't explicitly provide 
schema? If you were to use your own schema registry, you might as well provide 
an updated schema to hudi while writing.
   > * got it. would be nice to have some contribution. I can help review the 
patch.
   >   In the mean time, I will give it a try schema evolution on my end with 
some local set up.
   
   1. Not sure on how to pass row schema with defaults to spark writer (if at 
all its possible.)
   
   2. The idea for path is to update to avro schema with default of null for 
all nullable fields. If you wish you can refer function 
**getAvroSchemaWithDefs** in my local branch: 
https://github.com/aditiwari01/hudi/commit/68a7ad20cc570f88940f2481d0ae19986254f1d8#diff-21ccee08cebace9de801e104f356bf4333c017d5d5c0cac4e3de63c7861f3c13R58
   
   This is bit unstructured though. Will try to have a formal PR with tests and 
all. y next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-806444132


   > Can we also handle DELETE in this PR itself? That way, we have some basic 
support for all of the major DMLs
   
   
   
   > Can we also handle DELETE in this PR itself? That way, we have some basic 
support for all of the major DMLs
   
   Yes ,I will add the implementation of DELTE & UPDATE to the PR too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r601154948



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
##
@@ -617,6 +619,16 @@ public PropertyBuilder setTableName(String tableName) {
   return this;
 }
 
+public PropertyBuilder setTableSchema(String tableSchema) {
+  this.tableSchema = tableSchema;
+  return this;
+}
+
+public PropertyBuilder setRowKeyFields(String rowKeyFields) {

Review comment:
   Hi @vinothchandar , I am afraid we cannot do this for sql. The common 
way to specify the primary key in sql is by the row key fields, just like most 
of the database does. I think we should not provide the generator class for 
user to specify the primary key in sql.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r601160434



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieSparkSessionExtension.scala
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import org.apache.spark.SPARK_VERSION
+import org.apache.spark.sql.SparkSessionExtensions
+import org.apache.spark.sql.hudi.analysis.HoodieAnalysis
+import org.apache.spark.sql.hudi.parser.HoodieSqlParser
+
+/**
+  * The Hoodie SparkSessionExtension for extending the syntax and add the 
rules.
+  */
+class HoodieSparkSessionExtension extends (SparkSessionExtensions => Unit) {
+  override def apply(extensions: SparkSessionExtensions): Unit = {
+if (SPARK_VERSION.startsWith("2.")) {

Review comment:
   Hi @vinothchandar yes, currently we only add the parser for spark2, 
because spark3 has already support the merge/delete syntax.
   Hi @xiarixiaoyao ,Thanks for your suggestion. I will review this for spark3.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r601158842



##
File path: hudi-spark-datasource/hudi-spark2/src/main/antlr4/imports/SqlBase.g4
##
@@ -0,0 +1,1099 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ * This file is an adaptation of Presto's 
presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4 grammar.

Review comment:
   Hi @vinothchandar , currently the sql parser is only for the spark2, I 
add the sql parser in the `hudi-spark2 ` project only. For spark3, as you have 
mentioned, it has support the delete/merge syntax, we do not need to extend the 
sql parser.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r601154948



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
##
@@ -617,6 +619,16 @@ public PropertyBuilder setTableName(String tableName) {
   return this;
 }
 
+public PropertyBuilder setTableSchema(String tableSchema) {
+  this.tableSchema = tableSchema;
+  return this;
+}
+
+public PropertyBuilder setRowKeyFields(String rowKeyFields) {

Review comment:
   Hi @vinothchandar , I am afraid we cannot do this for sql. The common 
way to specify the primary key in sql is by the row key fields. I think we 
should not provide the generator class for user to specify the primary key in 
sql.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r601149509



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/UuidKeyGenerator.java
##
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.UUID;
+import java.util.stream.Collectors;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+
+/**
+ * A KeyGenerator which use the uuid as the record key.
+ */
+public class UuidKeyGenerator extends BuiltinKeyGenerator {

Review comment:
   This is used for `InsertInto` for hudi table without a primary key. 
Currently, we must return a `RecordKey` for each record when write to hudi. If 
there user has not specify a primary key in the `DDL`. I provide a default 
`UuidKeyGenerator` which generate uuid for the record.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r601152600



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
##
@@ -617,6 +619,16 @@ public PropertyBuilder setTableName(String tableName) {
   return this;
 }
 
+public PropertyBuilder setTableSchema(String tableSchema) {

Review comment:
   The `tableSchema` saved to the `hoodie.properites` is the first schema 
when we create table with the DDL. After that, the schema can also change after 
commit, this schema will store in the commit files currently.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hddong opened a new pull request #1946: [HUDI-1176]Upgrade tp log4j2

2021-03-25 Thread GitBox



hddong opened a new pull request #1946:
URL: https://github.com/apache/hudi/pull/1946


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Now, in some modules(like cli, utilities) use log4j2, and it cannot correct 
load config file with error log: `ERROR StatusLogger No log4j2 configuration 
file found. Using default configuration: logging only errors to the console`.*
   
   ## Brief change log
   
   *(for example:)*
 - *Support log4j2 config*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [X] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r601149509



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/UuidKeyGenerator.java
##
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.UUID;
+import java.util.stream.Collectors;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+
+/**
+ * A KeyGenerator which use the uuid as the record key.
+ */
+public class UuidKeyGenerator extends BuiltinKeyGenerator {

Review comment:
   This is used for `InsertInto` for hudi table without a primary key. 
Currently, we must return a `RecordKey` for each record when write to hudi. If 
user has not specify a primary key in the `DDL`. I provide a default 
`UuidKeyGenerator` which generate uuid for the record.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r601147731



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java
##
@@ -54,9 +53,18 @@
 public abstract class HoodieWriteHandle extends HoodieIOHandle {
 
   private static final Logger LOG = 
LogManager.getLogger(HoodieWriteHandle.class);
+  /**
+   * The input schema of the incoming dataframe.
+   */
+  protected final Schema inputSchema;

Review comment:
   For the case of MergeInto with a custom `HoodiePayload`, the input 
schema may be different from the write schema as the are transformation logical 
in the `payload`. So I distinguish between input and write schema.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

2021-03-25 Thread GitBox



pengzhiwei2018 commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r601145258



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
##
@@ -394,4 +405,36 @@ public IOType getIOType() {
   public HoodieBaseFile baseFileForMerge() {
 return baseFileToMerge;
   }
+
+  /**
+   * A special record returned by {@link HoodieRecordPayload}, which means
+   * {@link HoodieMergeHandle} should just skip this record.
+   */
+  private static class IgnoreRecord implements GenericRecord {

Review comment:
   For  the  `MergeInto` Statement, If there are not records match the 
conditions, we should filter the record by the `ExpressionPayload`. e.g.
   
   Merge Into d0
   using ( select 1 as id, 'a1' as name ) s0
   on d0.id = s0.id
   when matched and s0.id %2 = 0 then update set *
   
   The input `(1, 'a1')` will be filtered by the match condition `id %2 = 0`. 
In our implementation，we push all the condition and update expression to the   
`ExpressionPayload`, So the `Payload` must have the ability to filter the 
record. Currently the return of `Option.empty` means "DELTE" for record. I add 
a special record `IgnoredRecord` to
   represents the filter record.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hddong closed pull request #1946: [HUDI-1176]Upgrade tp log4j2

2021-03-25 Thread GitBox



hddong closed pull request #1946:
URL: https://github.com/apache/hudi/pull/1946


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wangxianghu commented on a change in pull request #2325: [HUDI-699]Fix CompactionCommand and add unit test for CompactionCommand

2021-03-25 Thread GitBox



wangxianghu commented on a change in pull request #2325:
URL: https://github.com/apache/hudi/pull/2325#discussion_r601092940



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
##
@@ -254,4 +288,19 @@ private int getArchivedFileSuffix(FileStatus f) {
   return 0;
 }
   }
+
+  @Override
+  public HoodieDefaultTimeline getCommitsAndCompactionTimeline() {
+// filter in-memory instants
+Set validActions = CollectionUtils.createSet(COMMIT_ACTION, 
DELTA_COMMIT_ACTION, COMPACTION_ACTION, REPLACE_COMMIT_ACTION);
+return new HoodieDefaultTimeline(getInstants().filter(i ->
+readCommits.keySet().contains(i.getTimestamp()))
+.filter(s -> validActions.contains(s.getAction())), details);
+  }
+
+  public HoodieArchivedTimeline filterArchivedCompactionInstant() {
+// filter INFLIGHT compaction instants
+return new HoodieArchivedTimeline(this.metaClient, getInstants().filter(i 
->
+i.isInflight() && 
i.getAction().equals(HoodieTimeline.COMPACTION_ACTION)));
+  }

Review comment:
   This method seems not used anywhere?

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
##
@@ -108,6 +120,14 @@ public void loadInstantDetailsInMemory(String startTs, 
String endTs) {
 loadInstants(startTs, endTs);
   }
 
+  public void loadCompactionDetailsInMemory(String startTs, String endTs) {
+// load compactionPlan
+loadInstants(new TimeRangeFilter(startTs, endTs), true, record ->
+
record.get(ACTION_TYPE_KEY).toString().equals(HoodieTimeline.COMPACTION_ACTION)
+&& 
HoodieInstant.State.INFLIGHT.toString().equals(record.get("actionState").toString())

Review comment:
   "actionState" -> ACTION_STATE

##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
##
@@ -175,25 +174,26 @@ public String compactionShowArchived(
 HoodieArchivedTimeline archivedTimeline = client.getArchivedTimeline();
 HoodieInstant instant = new HoodieInstant(HoodieInstant.State.COMPLETED,
 HoodieTimeline.COMPACTION_ACTION, compactionInstantTime);
-String startTs = CommitUtil.addHours(compactionInstantTime, -1);
-String endTs = CommitUtil.addHours(compactionInstantTime, 1);

Review comment:
   if we want to load a `ts` equals `compactionInstantTime`, can we add a 
new method that takes only one `instantTime` as input param? WDYT

##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
##
@@ -175,25 +174,26 @@ public String compactionShowArchived(
 HoodieArchivedTimeline archivedTimeline = client.getArchivedTimeline();
 HoodieInstant instant = new HoodieInstant(HoodieInstant.State.COMPLETED,
 HoodieTimeline.COMPACTION_ACTION, compactionInstantTime);
-String startTs = CommitUtil.addHours(compactionInstantTime, -1);
-String endTs = CommitUtil.addHours(compactionInstantTime, 1);

Review comment:
   why remove this?

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineMetadataUtils.java
##
@@ -176,4 +179,13 @@ public static HoodieReplaceCommitMetadata 
deserializeHoodieReplaceMetadata(byte[
 ValidationUtils.checkArgument(fileReader.hasNext(), "Could not deserialize 
metadata of type " + clazz);
 return fileReader.next();
   }
+
+  public static  T 
deserializeAvroRecordMetadata(byte[] bytes, Schema schema, Class clazz)
+  throws IOException {
+return  deserializeAvroRecordMetadata(HoodieAvroUtils.bytesToAvro(bytes, 
schema), schema, clazz);
+  }
+
+  public static  T 
deserializeAvroRecordMetadata(Object object, Schema schema, Class clazz) {

Review comment:
   param `Class clazz` seems redundant

##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
##
@@ -175,25 +174,26 @@ public String compactionShowArchived(
 HoodieArchivedTimeline archivedTimeline = client.getArchivedTimeline();
 HoodieInstant instant = new HoodieInstant(HoodieInstant.State.COMPLETED,
 HoodieTimeline.COMPACTION_ACTION, compactionInstantTime);
-String startTs = CommitUtil.addHours(compactionInstantTime, -1);
-String endTs = CommitUtil.addHours(compactionInstantTime, 1);
 try {
-  archivedTimeline.loadInstantDetailsInMemory(startTs, endTs);
-  HoodieCompactionPlan compactionPlan = 
TimelineMetadataUtils.deserializeCompactionPlan(
-  archivedTimeline.getInstantDetails(instant).get());
+  archivedTimeline.loadCompactionDetailsInMemory(compactionInstantTime, 
compactionInstantTime);
+  HoodieCompactionPlan compactionPlan = 
TimelineMetadataUtils.deserializeAvroRecordMetadata(
+  archivedTimeline.getInstantDetails(instant).get(), 
HoodieCompactionPlan.getClassSchema(),
+  HoodieCompactionPlan.class);

[GitHub] [hudi] prashantwason commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…

2021-03-25 Thread GitBox



prashantwason commented on pull request #2334:
URL: https://github.com/apache/hudi/pull/2334#issuecomment-806413911


   So to rephrase the description this solves the case where the input-data has 
a compatible field (int) to be written to the table (long field). Can this 
issue not be solved at the input record level by converting the "int" data into 
the "long" before writing into HUDI?  
   
   hoodieTable.getTableSchema() always returns the "latest" schema which is the 
schema used for the last HoodieWriteClient (saved into commit instants). So 
when the "int" based RDD is written the table schema will no long have a "long" 
field. When this table schema is used to read an older file in the table (merge 
during updatehandle) then the reading should fail as a long (from parquet) 
cannot be converted to an int (from schema). This is actually a backward 
incompatible schema change and hence is not allowed by HUDI.
   
   @pengzhiwei2018 Can you add a test to verify my hypothesis? In your existing 
test in TestCOWDataSource, can you write a long to the table in the next write? 
Also, can you read all the data using the "int" schema, even the older records 
which contain a long?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 103 matches

Mail list logo