[jira] [Closed] (HUDI-982) Make flink engine support MOR table
[ https://issues.apache.org/jira/browse/HUDI-982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianghu Wang closed HUDI-982. - Resolution: Duplicate > Make flink engine support MOR table > --- > > Key: HUDI-982 > URL: https://issues.apache.org/jira/browse/HUDI-982 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu#1 >Assignee: liujinhui >Priority: Major > Labels: pull-request-available > > Make flink engine support MOR table -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io edited a comment on pull request #2718: [HUDI-1495] Bump Flink version to 1.12.2
codecov-io edited a comment on pull request #2718: URL: https://github.com/apache/hudi/pull/2718#issuecomment-807923018 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=h1) Report > Merging [#2718](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=desc) (f6f0f75) into [master](https://codecov.io/gh/apache/hudi/commit/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50?el=desc) (6e803e0) will **increase** coverage by `0.02%`. > The diff coverage is `32.69%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2718/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2718 +/- ## + Coverage 51.72% 51.74% +0.02% - Complexity 3601 3606 +5 Files 476 476 Lines 2259522611 +16 Branches 2409 2410 +1 + Hits 1168711701 +14 + Misses 9889 9888 -1 - Partials 1019 1022 +3 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | | | hudicommon | `50.94% <ø> (+0.01%)` | `0.00 <ø> (ø)` | | | hudiflink | `54.18% <32.69%> (+0.09%)` | `0.00 <13.00> (ø)` | | | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `70.87% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `45.58% <ø> (ø)` | `0.00 <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `69.73% <ø> (ø)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...ache/hudi/sink/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JDb29yZGluYXRvci5qYXZh) | `68.94% <0.00%> (-0.44%)` | `32.00 <0.00> (ø)` | | | [...rg/apache/hudi/streamer/HoodieFlinkStreamerV2.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zdHJlYW1lci9Ib29kaWVGbGlua1N0cmVhbWVyVjIuamF2YQ==) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [...in/java/org/apache/hudi/table/HoodieTableSink.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNpbmsuamF2YQ==) | `12.19% <2.77%> (-2.10%)` | `2.00 <1.00> (ø)` | | | [.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh) | `60.97% <14.70%> (-0.48%)` | `25.00 <2.00> (-3.00)` | | | [...va/org/apache/hudi/configuration/FlinkOptions.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9jb25maWd1cmF0aW9uL0ZsaW5rT3B0aW9ucy5qYXZh) | `84.05% <76.92%> (-0.56%)` | `11.00 <4.00> (+4.00)` | :arrow_down: | | [...java/org/apache/hudi/table/HoodieTableFactory.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZUZhY3RvcnkuamF2YQ==) | `88.09% <94.11%> (+15.36%)` | `14.00 <6.00> (+3.00)` | | | [...c/main/java/org/apache/hudi/util/StreamerUtil.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1N0cmVhbWVyVXRpbC5qYXZh) | `49.53% <100.00%> (ø)` | `17.00 <0.00> (+1.00)` | | | [...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==) | `79.68% <0.00%> (+1.56%)` | `26.00% <0.00%> (ø%)` | | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2718: [HUDI-1495] Bump Flink version to 1.12.2
codecov-io commented on pull request #2718: URL: https://github.com/apache/hudi/pull/2718#issuecomment-807923018 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=h1) Report > Merging [#2718](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=desc) (f6f0f75) into [master](https://codecov.io/gh/apache/hudi/commit/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50?el=desc) (6e803e0) will **increase** coverage by `0.09%`. > The diff coverage is `32.69%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2718/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2718 +/- ## + Coverage 51.72% 51.81% +0.09% + Complexity 3601 3416 -185 Files 476 454 -22 Lines 2259521006-1589 Branches 2409 2255 -154 - Hits 1168710885 -802 + Misses 9889 9174 -715 + Partials 1019 947 -72 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | | | hudicommon | `50.94% <ø> (+0.01%)` | `0.00 <ø> (ø)` | | | hudiflink | `54.18% <32.69%> (+0.09%)` | `0.00 <13.00> (ø)` | | | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `70.87% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.73% <ø> (ø)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2718?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...ache/hudi/sink/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JDb29yZGluYXRvci5qYXZh) | `68.94% <0.00%> (-0.44%)` | `32.00 <0.00> (ø)` | | | [...rg/apache/hudi/streamer/HoodieFlinkStreamerV2.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zdHJlYW1lci9Ib29kaWVGbGlua1N0cmVhbWVyVjIuamF2YQ==) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [...in/java/org/apache/hudi/table/HoodieTableSink.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNpbmsuamF2YQ==) | `12.19% <2.77%> (-2.10%)` | `2.00 <1.00> (ø)` | | | [.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh) | `60.97% <14.70%> (-0.48%)` | `25.00 <2.00> (-3.00)` | | | [...va/org/apache/hudi/configuration/FlinkOptions.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9jb25maWd1cmF0aW9uL0ZsaW5rT3B0aW9ucy5qYXZh) | `84.05% <76.92%> (-0.56%)` | `11.00 <4.00> (+4.00)` | :arrow_down: | | [...java/org/apache/hudi/table/HoodieTableFactory.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZUZhY3RvcnkuamF2YQ==) | `88.09% <94.11%> (+15.36%)` | `14.00 <6.00> (+3.00)` | | | [...c/main/java/org/apache/hudi/util/StreamerUtil.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1N0cmVhbWVyVXRpbC5qYXZh) | `49.53% <100.00%> (ø)` | `17.00 <0.00> (+1.00)` | | | [.../org/apache/hudi/hive/NonPartitionedExtractor.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvTm9uUGFydGl0aW9uZWRFeHRyYWN0b3IuamF2YQ==) | | | | | [...apache/hudi/timeline/service/handlers/Handler.java](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvaGFuZGxlcnMvSGFuZGxlci5qYXZh) | | | | | ... and [21 more](https://codecov.io/gh/apache/hudi/pull/2718/diff?src=pr=tree-more) | | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please
[GitHub] [hudi] pengzhiwei2018 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on pull request #2651: URL: https://github.com/apache/hudi/pull/2651#issuecomment-807917549 > LGTM now. I mostly have just minor comments now which you can address. > > Can you run these unit tests you added once with `-Pspark3` to make sure this is running seamlessly for Spark 3 ? The travis tests right now don't run the tests with Spark 3. Thanks for your review @umehrot2 ! I will address these comments soon and test the compile with `-Pspark3`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2719: [HUDI-1721] run_sync_tool support hive3
danny0405 commented on a change in pull request #2719: URL: https://github.com/apache/hudi/pull/2719#discussion_r601994200 ## File path: hudi-sync/hudi-hive-sync/run_sync_tool.sh ## @@ -49,6 +49,15 @@ fi HIVE_JACKSON=`ls ${HIVE_HOME}/lib/jackson-*.jar | tr '\n' ':'` HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON +HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'` +if [ -n "$HIVE_CALCITE" ]; then +HIVE_JARS=$HIVE_JARS:$HIVE_CALCITE +fi +HIVE_LIBFB303=`ls ${HIVE_HOME}/lib/libfb303-*.jar | tr '\n' ':'` +if [ -n "$HIVE_LIBFB303" ]; then +HIVE_JARS=$HIVE_JARS:$HIVE_LIBFB303 Review comment: Is it possible to distinguish between Hive2 and Hive3 and make different library dependency ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Xoln commented on a change in pull request #2520: [HUDI-1446] Support skip bootstrapIndex's init in abstract fs view init
Xoln commented on a change in pull request #2520: URL: https://github.com/apache/hudi/pull/2520#discussion_r601986858 ## File path: hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/NoOpBootstrapIndex.java ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.bootstrap.index; + +import org.apache.hudi.common.model.BootstrapFileMapping; +import org.apache.hudi.common.table.HoodieTableMetaClient; + +import java.util.List; + +/** + * No Op Bootstrap Index , which is a emtpy implement and not do anything. Review comment: done ## File path: hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/NoOpBootstrapIndex.java ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.bootstrap.index; + +import org.apache.hudi.common.model.BootstrapFileMapping; +import org.apache.hudi.common.table.HoodieTableMetaClient; + +import java.util.List; + +/** + * No Op Bootstrap Index , which is a emtpy implement and not do anything. + */ +public class NoOpBootstrapIndex extends BootstrapIndex { + + public NoOpBootstrapIndex(HoodieTableMetaClient metaClient) { +super(metaClient); + } + + @Override + public IndexReader createReader() { +return null; Review comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
xiarixiaoyao commented on a change in pull request #2721: URL: https://github.com/apache/hudi/pull/2721#discussion_r601949535 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java ## @@ -95,15 +103,24 @@ public boolean next(NullWritable aVoid, ArrayWritable arrayWritable) throws IOEx // TODO(NA): Invoke preCombine here by converting arrayWritable to Avro. This is required since the // deltaRecord may not be a full record and needs values of columns from the parquet Option rec; -if (usesCustomPayload) { - rec = deltaRecordMap.get(key).getData().getInsertValue(getWriterSchema()); -} else { - rec = deltaRecordMap.get(key).getData().getInsertValue(getReaderSchema()); +rec = buildGenericRecordwithCustomPayload(deltaRecordMap.get(key)); +// If the record is not present, this is a delete record using an empty payload so skip this base record +// and move to the next record +while (!rec.isPresent()) { + // if current parquet reader has no record, return false + if (!this.parquetReader.next(aVoid, arrayWritable)) { Review comment: @garyli1019 thanks for your replay. no, we will not miss a record. when we call this.parquetReader.next(aVoid, arrayWritable) and this function return true, parquet reader wil fill the new record to arrayWritable auto(see parquet reader source code), source code of parquet reader: public boolean next(final NullWritable key, final ArrayWritable value) throws IOException { if (value != null && arrValue.length == arrCurrent.length) { System.arraycopy(arrCurrent, 0, arrValue, 0, arrCurrent.length); } else { if (arrValue.length != arrCurrent.length) { throw new IOException("DeprecatedParquetHiveInput : size of object differs. Value" + " size : " + arrValue.length + ", Current Object size : " + arrCurrent.length); } else { throw new IOException("DeprecatedParquetHiveInput can not support RecordReaders that" + " don't return same key & value & value is null"); } } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] root18039532923 commented on issue #2623: org.apache.hudi.exception.HoodieDependentSystemUnavailableException:System HBASE unavailable.
root18039532923 commented on issue #2623: URL: https://github.com/apache/hudi/issues/2623#issuecomment-807857780 no reply -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
umehrot2 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601930831 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala ## @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi + +import java.util.Properties + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.{HoodieMetadataConfig, SerializableConfiguration} +import org.apache.hudi.common.engine.HoodieLocalEngineContext +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.HoodieBaseFile +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.{InternalRow, expressions} +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.avro.SchemaConverters +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, BoundReference, Expression, InterpretedPredicate} +import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils} +import org.apache.spark.sql.execution.datasources.{FileIndex, PartitionDirectory, PartitionUtils} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +/** + * A File Index which support partition prune for hoodie snapshot and read-optimized + * query. + * Main steps to get the file list for query: + * 1、Load all files and partition values from the table path. + * 2、Do the partition prune by the partition filter condition. + * + * There are 3 cases for this: + * 1、If the partition columns size is equal to the actually partition path level, we + * read it as partitioned table.(e.g partition column is "dt", the partition path is "2021-03-10") + * + * 2、If the partition columns size is not equal to the partition path level, but the partition + * column size is "1" (e.g. partition column is "dt", but the partition path is "2021/03/10" + * who'es directory level is 3).We can still read it as a partitioned table. We will mapping the + * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt"). + * + * 3、Else the the partition columns size is not equal to the partition directory level and the + * size is great than "1" (e.g. partition column is "dt,hh", the partition path is "2021/03/10/12") + * , we read it as a None Partitioned table because we cannot know how to mapping the partition + * path with the partition columns in this case. + */ +case class HoodieFileIndex( + spark: SparkSession, + basePath: String, + schemaSpec: Option[StructType], + options: Map[String, String]) + extends FileIndex with Logging { + + @transient private val hadoopConf = spark.sessionState.newHadoopConf() + private lazy val metaClient = HoodieTableMetaClient +.builder().setConf(hadoopConf).setBasePath(basePath).build() + + @transient private val queryPath = new Path(options.getOrElse("path", "'path' option required")) + /** +* Get the schema of the table. +*/ + lazy val schema: StructType = schemaSpec.getOrElse({ +val schemaUtil = new TableSchemaResolver(metaClient) +SchemaConverters.toSqlType(schemaUtil.getTableAvroSchema) + .dataType.asInstanceOf[StructType] + }) + + /** +* Get the partition schema from the hoodie.properties. +*/ + private lazy val _partitionSchemaFromProperties: StructType = { +val tableConfig = metaClient.getTableConfig +val partitionColumns = tableConfig.getPartitionColumns +val nameFieldMap = schema.fields.map(filed => filed.name -> filed).toMap + +if (partitionColumns.isPresent) { + val partitionFields = partitionColumns.get().map(column => +nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column " + +
[GitHub] [hudi] umehrot2 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
umehrot2 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601930658 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala ## @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi + +import java.util.Properties + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.{HoodieMetadataConfig, SerializableConfiguration} +import org.apache.hudi.common.engine.HoodieLocalEngineContext +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.HoodieBaseFile +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.{InternalRow, expressions} +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.avro.SchemaConverters +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, BoundReference, Expression, InterpretedPredicate} +import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils} +import org.apache.spark.sql.execution.datasources.{FileIndex, PartitionDirectory, PartitionUtils} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +/** + * A File Index which support partition prune for hoodie snapshot and read-optimized + * query. + * Main steps to get the file list for query: + * 1、Load all files and partition values from the table path. + * 2、Do the partition prune by the partition filter condition. + * + * There are 3 cases for this: + * 1、If the partition columns size is equal to the actually partition path level, we + * read it as partitioned table.(e.g partition column is "dt", the partition path is "2021-03-10") + * + * 2、If the partition columns size is not equal to the partition path level, but the partition + * column size is "1" (e.g. partition column is "dt", but the partition path is "2021/03/10" + * who'es directory level is 3).We can still read it as a partitioned table. We will mapping the + * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt"). + * + * 3、Else the the partition columns size is not equal to the partition directory level and the + * size is great than "1" (e.g. partition column is "dt,hh", the partition path is "2021/03/10/12") + * , we read it as a None Partitioned table because we cannot know how to mapping the partition + * path with the partition columns in this case. + */ +case class HoodieFileIndex( + spark: SparkSession, + basePath: String, + schemaSpec: Option[StructType], + options: Map[String, String]) + extends FileIndex with Logging { + + @transient private val hadoopConf = spark.sessionState.newHadoopConf() + private lazy val metaClient = HoodieTableMetaClient +.builder().setConf(hadoopConf).setBasePath(basePath).build() + + @transient private val queryPath = new Path(options.getOrElse("path", "'path' option required")) + /** +* Get the schema of the table. +*/ + lazy val schema: StructType = schemaSpec.getOrElse({ +val schemaUtil = new TableSchemaResolver(metaClient) +SchemaConverters.toSqlType(schemaUtil.getTableAvroSchema) + .dataType.asInstanceOf[StructType] + }) + + /** +* Get the partition schema from the hoodie.properties. +*/ + private lazy val _partitionSchemaFromProperties: StructType = { +val tableConfig = metaClient.getTableConfig +val partitionColumns = tableConfig.getPartitionColumns +val nameFieldMap = schema.fields.map(filed => filed.name -> filed).toMap + +if (partitionColumns.isPresent) { + val partitionFields = partitionColumns.get().map(column => +nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column " + +
[GitHub] [hudi] umehrot2 edited a comment on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
umehrot2 edited a comment on pull request #2651: URL: https://github.com/apache/hudi/pull/2651#issuecomment-807830171 LGTM now. I mostly have just minor comments now which you can address. Can you run these unit tests you added once with `-Pspark3` to make sure this is running seamlessly for Spark 3 ? The travis tests right now don't run the tests with Spark 3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
umehrot2 commented on pull request #2651: URL: https://github.com/apache/hudi/pull/2651#issuecomment-807830171 Can you run these unit tests you added once with `-Pspark3` to make sure this is running seamlessly for Spark 3 ? The travis tests right now don't run the tests with Spark 3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
umehrot2 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601854740 ## File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java ## @@ -276,6 +276,16 @@ public static void processFiles(FileSystem fs, String basePathStr, Functionhttp://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi + +import java.util.Properties + +import scala.collection.JavaConverters._ +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.{HoodieMetadataConfig, SerializableConfiguration} +import org.apache.hudi.common.engine.HoodieLocalEngineContext +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.HoodieBaseFile +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.{InternalRow, expressions} +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.avro.SchemaConverters +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, BoundReference, Expression, InterpretedPredicate} +import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils} +import org.apache.spark.sql.execution.datasources.{FileIndex, FileStatusCache, NoopCache, PartitionDirectory, PartitionUtils} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +import scala.collection.mutable + +/** + * A File Index which support partition prune for hoodie snapshot and read-optimized + * query. + * Main steps to get the file list for query: + * 1、Load all files and partition values from the table path. + * 2、Do the partition prune by the partition filter condition. + * + * There are 3 cases for this: + * 1、If the partition columns size is equal to the actually partition path level, we + * read it as partitioned table.(e.g partition column is "dt", the partition path is "2021-03-10") + * + * 2、If the partition columns size is not equal to the partition path level, but the partition + * column size is "1" (e.g. partition column is "dt", but the partition path is "2021/03/10" + * who'es directory level is 3).We can still read it as a partitioned table. We will mapping the + * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt"). + * + * 3、Else the the partition columns size is not equal to the partition directory level and the + * size is great than "1" (e.g. partition column is "dt,hh", the partition path is "2021/03/10/12") + * , we read it as a None Partitioned table because we cannot know how to mapping the partition + * path with the partition columns in this case. + */ +case class HoodieFileIndex( + spark: SparkSession, + metaClient: HoodieTableMetaClient, + schemaSpec: Option[StructType], + options: Map[String, String], + @transient fileStatusCache: FileStatusCache = NoopCache) + extends FileIndex with Logging { + + private val basePath = metaClient.getBasePath + + @transient private val queryPath = new Path(options.getOrElse("path", "'path' option required")) + /** +* Get the schema of the table. +*/ + lazy val schema: StructType = schemaSpec.getOrElse({ +val schemaUtil = new TableSchemaResolver(metaClient) +SchemaConverters.toSqlType(schemaUtil.getTableAvroSchema) + .dataType.asInstanceOf[StructType] + }) + + /** +* Get the partition schema from the hoodie.properties. +*/ + private lazy val _partitionSchemaFromProperties: StructType = { +val tableConfig = metaClient.getTableConfig +val partitionColumns = tableConfig.getPartitionColumns +val nameFieldMap = schema.fields.map(filed => filed.name -> filed).toMap + +if (partitionColumns.isPresent) { + val partitionFields = partitionColumns.get().map(column => +nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" + + s"$column' in the schema[${schema.fields.mkString(",")}]"))) + new StructType(partitionFields) +} else { // If the partition columns have not stored in hoodie.properites(the table that was + // created earlier), we trait it as a none-partitioned table. + new StructType() +} + } + + @transient @volatile private var fileSystemView:
[GitHub] [hudi] n3nash commented on a change in pull request #2607: [HUDI-1643] Hudi observability - framework to report stats from execu…
n3nash commented on a change in pull request #2607: URL: https://github.com/apache/hudi/pull/2607#discussion_r601926515 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java ## @@ -336,8 +340,12 @@ public void write(GenericRecord oldRecord) { } long fileSizeInBytes = FSUtils.getFileSize(fs, newFilePath); - HoodieWriteStat stat = writeStatus.getStat(); + // record write metrics + executorMetrics.recordWriteMetrics(getIOType(), InetAddress.getLocalHost().getHostName(), Review comment: private Map> { list.add("io_type", getIOType()); .. .. return new Map(CONSTANTS.WRITE_METRICS, list) } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] MyLanPangzi commented on pull request #2719: [HUDI-1721] run_sync_tool support hive3
MyLanPangzi commented on pull request #2719: URL: https://github.com/apache/hudi/pull/2719#issuecomment-807731849 Sorry,I forgot comment the issue https://github.com/apache/hudi/issues/2717 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rubenssoto commented on pull request #2699: [HUDI-1709] Improving lock config names and adding hive metastore uri config
rubenssoto commented on pull request #2699: URL: https://github.com/apache/hudi/pull/2699#issuecomment-807731807 Hello, How this change about hive metastore uri works? I want to try hive metastore instead of hive to sync metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2678: Added support for replace commits in commit showpartitions, commit sh…
satishkotha commented on a change in pull request #2678: URL: https://github.com/apache/hudi/pull/2678#discussion_r594753875 ## File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java ## @@ -431,4 +442,20 @@ public String syncCommits(@CliOption(key = {"path"}, help = "Path of the table t return "Load sync state between " + HoodieCLI.getTableMetaClient().getTableConfig().getTableName() + " and " + HoodieCLI.syncTableMetadata.getTableConfig().getTableName(); } + + /* + Checks whether a commit or replacecommit action exists in the timeline. + * */ + private Option getCommitOrReplaceCommitInstant(HoodieTimeline timeline, String instantTime) { Review comment: consider changing signature to return Option of HoodieCommitMetadata and deserialize instant details inside this method. This would avoid repetition to get instant details in multiple places. You can also do additional validation. for example: for replace commit, deserialize using HoodieReplaceCommitMetadata class -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on pull request #2678: Added support for replace commits in commit showpartitions, commit sh…
n3nash commented on pull request #2678: URL: https://github.com/apache/hudi/pull/2678#issuecomment-807130229 @jsbali Please file a jira ticket and add it to the heading of this PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
codecov-io edited a comment on pull request #2645: URL: https://github.com/apache/hudi/pull/2645#issuecomment-792430670 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=h1) Report > Merging [#2645](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=desc) (05ef658) into [master](https://codecov.io/gh/apache/hudi/commit/900de34e45b4c1d19c01ea84adc38413f2bd52ff?el=desc) (900de34) will **increase** coverage by `17.96%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2645/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2645 +/- ## = + Coverage 51.76% 69.73% +17.96% + Complexity 3601 371 -3230 = Files 476 54 -422 Lines 22579 1989-20590 Branches 2407 236 -2171 = - Hits 11688 1387-10301 + Misses 9874 471 -9403 + Partials 1017 131 -886 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.73% <ø> (ø)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2645?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `71.37% <0.00%> (ø)` | `55.00% <0.00%> (ø%)` | | | [...ache/hudi/common/fs/SizeAwareDataOutputStream.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL1NpemVBd2FyZURhdGFPdXRwdXRTdHJlYW0uamF2YQ==) | | | | | [...rg/apache/hudi/common/model/HoodieRollingStat.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJvbGxpbmdTdGF0LmphdmE=) | | | | | [...rc/main/java/org/apache/hudi/ApiMaturityLevel.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvQXBpTWF0dXJpdHlMZXZlbC5qYXZh) | | | | | [...ache/hudi/source/StreamReadMonitoringFunction.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvU3RyZWFtUmVhZE1vbml0b3JpbmdGdW5jdGlvbi5qYXZh) | | | | | [.../hudi/common/bloom/InternalDynamicBloomFilter.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL0ludGVybmFsRHluYW1pY0Jsb29tRmlsdGVyLmphdmE=) | | | | | [...e/hudi/common/util/queue/BoundedInMemoryQueue.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvcXVldWUvQm91bmRlZEluTWVtb3J5UXVldWUuamF2YQ==) | | | | | [...a/org/apache/hudi/cli/HoodieTableHeaderFields.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZVRhYmxlSGVhZGVyRmllbGRzLmphdmE=) | | | | | [...i/table/format/cow/Int64TimestampColumnReader.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9mb3JtYXQvY293L0ludDY0VGltZXN0YW1wQ29sdW1uUmVhZGVyLmphdmE=) | | | | | [...udi/spark3/internal/HoodieWriterCommitMessage.java](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmszL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3NwYXJrMy9pbnRlcm5hbC9Ib29kaWVXcml0ZXJDb21taXRNZXNzYWdlLmphdmE=) | | | | | ... and [405 more](https://codecov.io/gh/apache/hudi/pull/2645/diff?src=pr=tree-more) | | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1716) rt view w/ MOR tables fails after schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1716: -- Description: Looks like realtime view w/ MOR table fails if schema present in existing log file is evolved to add a new field. no issues w/ writing. but reading fails More info: [https://github.com/apache/hudi/issues/2675] gist of the stack trace: Caused by: org.apache.avro.AvroTypeException: Found hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field evolvedFieldCaused by: org.apache.avro.AvroTypeException: Found hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field evolvedField at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock.deserializeRecords(HoodieAvroDataBlock.java:165) at org.apache.hudi.common.table.log.block.HoodieDataBlock.createRecordsFromContentBytes(HoodieDataBlock.java:128) at org.apache.hudi.common.table.log.block.HoodieDataBlock.getRecords(HoodieDataBlock.java:106) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processDataBlock(AbstractHoodieLogRecordScanner.java:289) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processQueuedBlocksForInstant(AbstractHoodieLogRecordScanner.java:324) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:252) ... 24 more21/03/25 11:27:03 WARN TaskSetManager: Lost task 0.0 in stage 83.0 (TID 667, sivabala-c02xg219jgh6.attlocal.net, executor driver): org.apache.hudi.exception.HoodieException: Exception when reading log file at org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:261) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:100) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:93) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:75) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:230) at org.apache.hudi.HoodieMergeOnReadRDD$.scanLog(HoodieMergeOnReadRDD.scala:328) at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.(HoodieMergeOnReadRDD.scala:210) at org.apache.hudi.HoodieMergeOnReadRDD.payloadCombineFileIterator(HoodieMergeOnReadRDD.scala:200) at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:77) Logs from local run: [https://gist.github.com/nsivabalan/656956ab313676617d84002ef8942198] diff with which above logs were generated: [https://gist.github.com/nsivabalan/84dad29bc1ab567ebb6ee8c63b3969ec] Steps to reproduce in spark shell: # create MOR table w/ schema1. # Ingest (with schema1) until log files are created. // verify via hudi-cli. I didn't see log files w/ just 1 batch of updates. If not, do multiple rounds until you see log files. # create a new schema2 with one new additional field. ingest a batch with schema2 that updates existing records. # read entire dataset. was: Looks like realtime view w/ MOR table fails if schema present in existing log file is evolved to add a new field. no issues w/ writing. but reading fails More info: [https://github.com/apache/hudi/issues/2675] Logs from local run: [https://gist.github.com/nsivabalan/656956ab313676617d84002ef8942198] diff with which above logs were generated: [https://gist.github.com/nsivabalan/84dad29bc1ab567ebb6ee8c63b3969ec] Steps to reproduce in spark shell: # create MOR table w/ schema1. # Ingest (with schema1) until log files are created. // verify via hudi-cli. I didn't see log files w/ just 1 batch of updates. If not, do multiple rounds until you see log files. # create a new schema2 with one new additional field. ingest a batch with schema2 that updates existing records. # read entire dataset. > rt view w/ MOR tables fails after schema evolution > -- > > Key: HUDI-1716 > URL: https://issues.apache.org/jira/browse/HUDI-1716 > Project: Apache Hudi > Issue Type: Bug >
[jira] [Updated] (HUDI-1716) rt view w/ MOR tables fails after schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1716: -- Description: Looks like realtime view w/ MOR table fails if schema present in existing log file is evolved to add a new field. no issues w/ writing. but reading fails More info: [https://github.com/apache/hudi/issues/2675] gist of the stack trace: Caused by: org.apache.avro.AvroTypeException: Found hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field evolvedFieldCaused by: org.apache.avro.AvroTypeException: Found hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field evolvedField at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock.deserializeRecords(HoodieAvroDataBlock.java:165) at org.apache.hudi.common.table.log.block.HoodieDataBlock.createRecordsFromContentBytes(HoodieDataBlock.java:128) at org.apache.hudi.common.table.log.block.HoodieDataBlock.getRecords(HoodieDataBlock.java:106) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processDataBlock(AbstractHoodieLogRecordScanner.java:289) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processQueuedBlocksForInstant(AbstractHoodieLogRecordScanner.java:324) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:252) ... 24 more21/03/25 11:27:03 WARN TaskSetManager: Lost task 0.0 in stage 83.0 (TID 667, sivabala-c02xg219jgh6.attlocal.net, executor driver): org.apache.hudi.exception.HoodieException: Exception when reading log file at org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:261) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:100) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:93) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:75) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:230) at org.apache.hudi.HoodieMergeOnReadRDD$.scanLog(HoodieMergeOnReadRDD.scala:328) at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.(HoodieMergeOnReadRDD.scala:210) at org.apache.hudi.HoodieMergeOnReadRDD.payloadCombineFileIterator(HoodieMergeOnReadRDD.scala:200) at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:77) Logs from local run: [https://gist.github.com/nsivabalan/656956ab313676617d84002ef8942198] diff with which above logs were generated: [https://gist.github.com/nsivabalan/84dad29bc1ab567ebb6ee8c63b3969ec] Steps to reproduce in spark shell: # create MOR table w/ schema1. # Ingest (with schema1) until log files are created. // verify via hudi-cli. It took me 2 batch of updates to see a log file. # create a new schema2 with one new additional field. ingest a batch with schema2 that updates existing records. # read entire dataset. was: Looks like realtime view w/ MOR table fails if schema present in existing log file is evolved to add a new field. no issues w/ writing. but reading fails More info: [https://github.com/apache/hudi/issues/2675] gist of the stack trace: Caused by: org.apache.avro.AvroTypeException: Found hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field evolvedFieldCaused by: org.apache.avro.AvroTypeException: Found hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field evolvedField at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at
[GitHub] [hudi] garyli1019 commented on a change in pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
garyli1019 commented on a change in pull request #2721: URL: https://github.com/apache/hudi/pull/2721#discussion_r601540575 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java ## @@ -95,15 +103,24 @@ public boolean next(NullWritable aVoid, ArrayWritable arrayWritable) throws IOEx // TODO(NA): Invoke preCombine here by converting arrayWritable to Avro. This is required since the // deltaRecord may not be a full record and needs values of columns from the parquet Option rec; -if (usesCustomPayload) { - rec = deltaRecordMap.get(key).getData().getInsertValue(getWriterSchema()); -} else { - rec = deltaRecordMap.get(key).getData().getInsertValue(getReaderSchema()); +rec = buildGenericRecordwithCustomPayload(deltaRecordMap.get(key)); +// If the record is not present, this is a delete record using an empty payload so skip this base record +// and move to the next record +while (!rec.isPresent()) { + // if current parquet reader has no record, return false + if (!this.parquetReader.next(aVoid, arrayWritable)) { Review comment: if parquet has records, this will get the record but we didn't read it, so I guess we will miss a record here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
codecov-io edited a comment on pull request #2651: URL: https://github.com/apache/hudi/pull/2651#issuecomment-794945140 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2651?src=pr=h1) Report > Merging [#2651](https://codecov.io/gh/apache/hudi/pull/2651?src=pr=desc) (baa4876) into [master](https://codecov.io/gh/apache/hudi/commit/ce3e8ec87083ef4cd4f33de39b6697f66ff3f277?el=desc) (ce3e8ec) will **increase** coverage by `0.16%`. > The diff coverage is `74.49%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2651/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2651?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2651 +/- ## + Coverage 51.76% 51.93% +0.16% - Complexity 3602 3645 +43 Files 476 478 +2 Lines 2257922797 +218 Branches 2408 2447 +39 + Hits 1168811839 +151 - Misses 9874 9906 +32 - Partials 1017 1052 +35 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `50.88% <0.00%> (-0.05%)` | `0.00 <0.00> (ø)` | | | hudiflink | `54.08% <ø> (-0.20%)` | `0.00 <ø> (ø)` | | | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `71.78% <79.73%> (+0.91%)` | `0.00 <26.00> (ø)` | | | hudisync | `45.58% <ø> (-0.12%)` | `0.00 <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `69.72% <50.00%> (-0.06%)` | `0.00 <0.00> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2651?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...c/main/java/org/apache/hudi/common/fs/FSUtils.java](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0ZTVXRpbHMuamF2YQ==) | `47.34% <0.00%> (-0.94%)` | `57.00 <0.00> (ø)` | | | [...rg/apache/hudi/common/table/HoodieTableConfig.java](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlQ29uZmlnLmphdmE=) | `43.20% <0.00%> (-2.25%)` | `17.00 <0.00> (ø)` | | | [...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh) | `66.66% <0.00%> (-1.65%)` | `43.00 <0.00> (ø)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `71.28% <50.00%> (-0.45%)` | `56.00 <0.00> (ø)` | | | [...src/main/scala/org/apache/hudi/DefaultSource.scala](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0RlZmF1bHRTb3VyY2Uuc2NhbGE=) | `79.38% <68.42%> (-4.77%)` | `31.00 <0.00> (+14.00)` | :arrow_down: | | [...c/main/scala/org/apache/hudi/HoodieFileIndex.scala](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZUZpbGVJbmRleC5zY2FsYQ==) | `79.33% <79.33%> (ø)` | `24.00 <24.00> (?)` | | | [.../org/apache/hudi/MergeOnReadSnapshotRelation.scala](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL01lcmdlT25SZWFkU25hcHNob3RSZWxhdGlvbi5zY2FsYQ==) | `89.79% <87.50%> (+0.66%)` | `18.00 <1.00> (+1.00)` | | | [...cala/org/apache/hudi/HoodieBootstrapRelation.scala](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZUJvb3RzdHJhcFJlbGF0aW9uLnNjYWxh) | `89.01% <100.00%> (+1.51%)` | `18.00 <1.00> (+3.00)` | | | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2651/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh) | `58.33% <100.00%> (+0.54%)` | `0.00 <0.00> (ø)` | | |
[GitHub] [hudi] codecov-io commented on pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
codecov-io commented on pull request #2721: URL: https://github.com/apache/hudi/pull/2721#issuecomment-806823048 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2721?src=pr=h1) Report > Merging [#2721](https://codecov.io/gh/apache/hudi/pull/2721?src=pr=desc) (fb529db) into [master](https://codecov.io/gh/apache/hudi/commit/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50?el=desc) (6e803e0) will **decrease** coverage by `42.32%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2721/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2721?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2721 +/- ## - Coverage 51.72% 9.40% -42.33% + Complexity 3601 48 -3553 Files 476 54 -422 Lines 225951989-20606 Branches 2409 236 -2173 - Hits 11687 187-11500 + Misses 98891789 -8100 + Partials 1019 13 -1006 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.40% <ø> (-60.34%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2721?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2721/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>
[GitHub] [hudi] cdmikechen commented on a change in pull request #2719: [HUDI-1721] run_sync_tool support hive3
cdmikechen commented on a change in pull request #2719: URL: https://github.com/apache/hudi/pull/2719#discussion_r601483903 ## File path: hudi-sync/hudi-hive-sync/run_sync_tool.sh ## @@ -49,6 +49,15 @@ fi HIVE_JACKSON=`ls ${HIVE_HOME}/lib/jackson-*.jar | tr '\n' ':'` HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON +HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'` +if [ -n "$HIVE_CALCITE" ]; then +HIVE_JARS=$HIVE_JARS:$HIVE_CALCITE +fi +HIVE_LIBFB303=`ls ${HIVE_HOME}/lib/libfb303-*.jar | tr '\n' ':'` +if [ -n "$HIVE_LIBFB303" ]; then +HIVE_JARS=$HIVE_JARS:$HIVE_LIBFB303 Review comment: I had the same problem, hive3 need these jars to be connected. The dependency packages of hive2 and hive3 on the connection have been changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a change in pull request #2719: [HUDI-1721] run_sync_tool support hive3
cdmikechen commented on a change in pull request #2719: URL: https://github.com/apache/hudi/pull/2719#discussion_r601483289 ## File path: hudi-sync/hudi-hive-sync/run_sync_tool.sh ## @@ -49,6 +49,15 @@ fi HIVE_JACKSON=`ls ${HIVE_HOME}/lib/jackson-*.jar | tr '\n' ':'` HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON +HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'` +if [ -n "$HIVE_CALCITE" ]; then +HIVE_JARS=$HIVE_JARS:$HIVE_CALCITE +fi +HIVE_LIBFB303=`ls ${HIVE_HOME}/lib/libfb303-*.jar | tr '\n' ':'` +if [ -n "$HIVE_LIBFB303" ]; then +HIVE_JARS=$HIVE_JARS:$HIVE_LIBFB303 Review comment: I can explain. I had the same problem, Hudi lacks the dependency libs for hive3 connection. The dependency packages of hive2 and hive3 on the connection have been changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a change in pull request #2719: [HUDI-1721] run_sync_tool support hive3
cdmikechen commented on a change in pull request #2719: URL: https://github.com/apache/hudi/pull/2719#discussion_r601476629 ## File path: hudi-sync/hudi-hive-sync/run_sync_tool.sh ## @@ -49,6 +49,15 @@ fi HIVE_JACKSON=`ls ${HIVE_HOME}/lib/jackson-*.jar | tr '\n' ':'` HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON +HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'` +if [ -n "$HIVE_CALCITE" ]; then +HIVE_JARS=$HIVE_JARS:$HIVE_CALCITE +fi +HIVE_LIBFB303=`ls ${HIVE_HOME}/lib/libfb303-*.jar | tr '\n' ':'` +if [ -n "$HIVE_LIBFB303" ]; then +HIVE_JARS=$HIVE_JARS:$HIVE_LIBFB303 Review comment: I had the same problem, hive3 need these jars to be connected. The dependency packages of hive2 and hive3 on the connection have been changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE
xiarixiaoyao edited a comment on pull request #2722: URL: https://github.com/apache/hudi/pull/2722#issuecomment-806721657 test step: before patch: step1: val df = spark.range(0, 10).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // create hoodie table hive_14b merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") notice: bulk_insert will produce 4 files in hoodie table step2: val df = spark.range(9, 12).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // upsert table merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") now : we have four base files and one log file in hoodie table step3: spark-sql/beeline: select count(col3) from hive_14b_rt; then the query failed. 2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi. hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax .security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) after patch: spark-sql/hive-beeline select count(col3) from hive_14b_rt; +-+ | _c0 | +-+ | 12 | +-+ merge function: def merge(df: org.apache.spark.sql.DataFrame, par: Int, db: String, tableName: String, tableType: String = DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, hivePartitionExtract: String = "org.apache.hudi.hive.MultiPartKeysValueExtractor", op: String = "upsert"): Unit = { val mode = if (op.equals("bulk_insert")) { Overwrite } else { Append } df.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType). option(HoodieCompactionConfig.INLINE_COMPACT_PROP, false). option(PRECOMBINE_FIELD_OPT_KEY, "col3"). option(RECORDKEY_FIELD_OPT_KEY, "keyid"). option(PARTITIONPATH_FIELD_OPT_KEY, "p,p1,p2"). option(DataSourceWriteOptions.OPERATION_OPT_KEY, op). option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, classOf[ComplexKeyGenerator].getName). option("hoodie.bulkinsert.shuffle.parallelism", par.toString). option("hoodie.metadata.enable", "false"). option("hoodie.insert.shuffle.parallelism", par.toString). option("hoodie.upsert.shuffle.parallelism", par.toString). option("hoodie.delete.shuffle.parallelism", par.toString). option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true"). option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "p,p1,p2").
[GitHub] [hudi] xiarixiaoyao commented on pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE
xiarixiaoyao commented on pull request #2722: URL: https://github.com/apache/hudi/pull/2722#issuecomment-806722454 @garyli1019 could you pls help me review this pr, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE
xiarixiaoyao commented on pull request #2722: URL: https://github.com/apache/hudi/pull/2722#issuecomment-806721657 test step: before patch: step1: val df = spark.range(0, 10).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // create hoodie table hive_14b merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") notice: bulk_insert will produce 4 files in hoodie table step2: val df = spark.range(9, 12).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // upsert table merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") now : we have four base files and one log file in hoodie table step3: spark-sql/beeline: select count(col3) from hive_14b_rt; then the query failed. 2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi. hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax .security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) after patch: spark-sql/hive-beeline select count(col3) from hive_14b_rt; +-+ | _c0 | +-+ | 12 | +-+ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed
codecov-io commented on pull request #2716: URL: https://github.com/apache/hudi/pull/2716#issuecomment-806719923 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2716?src=pr=h1) Report > Merging [#2716](https://codecov.io/gh/apache/hudi/pull/2716?src=pr=desc) (9ffe931) into [master](https://codecov.io/gh/apache/hudi/commit/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50?el=desc) (6e803e0) will **decrease** coverage by `42.32%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2716/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2716?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2716 +/- ## - Coverage 51.72% 9.40% -42.33% + Complexity 3601 48 -3553 Files 476 54 -422 Lines 225951989-20606 Branches 2409 236 -2173 - Hits 11687 187-11500 + Misses 98891789 -8100 + Partials 1019 13 -1006 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.40% <ø> (-60.34%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2716?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2716/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>
[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE
[ https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1722: - Labels: pull-request-available (was: ) > hive beeline/spark-sql query specified field on mor table occur NPE > > > Key: HUDI-1722 > URL: https://issues.apache.org/jira/browse/HUDI-1722 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration, Spark Integration >Affects Versions: 0.7.0 > Environment: spark2.4.5, hadoop3.1.1, hive 3.1.1 >Reporter: tao meng >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > HUDI-892 introduce this problem。 > this pr skip adding projection columns if there are no log files in the > hoodieRealtimeSplit。 but this pr donnot consider that multiple > getRecordReaders share same jobConf。 > Consider the following questions: > we have four getRecordReaders: > reader1(its hoodieRealtimeSplit contains no log files) > reader2 (its hoodieRealtimeSplit contains log files) > reader3(its hoodieRealtimeSplit contains log files) > reader4(its hoodieRealtimeSplit contains no log files) > now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in > jobConf will be set to be true, and no hoodie additional projection columns > will be added to jobConf (see > HoodieParquetRealtimeInputFormat.addProjectionToJobConf) > reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in > jobConf is set to be true, no hoodie additional projection columns will be > added to jobConf. (see > HoodieParquetRealtimeInputFormat.addProjectionToJobConf) > which lead to the result that _hoodie_record_key would be missing and merge > step would throw exceptions > 2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics > report from attempt_1615883368881_0038_m_00_0: Error: > java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO | > AsyncDispatcher event handler | Diagnostics report from > attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) > at > org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68) > at > org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77) > at > org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at > org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at > org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) > > Obviously, this is an occasional problem。 if reader2 run first, hoodie > additional projection columns will be added to jobConf and in this case the > query will be ok > sparksql can avoid this problem by set spark.hadoop.cloneConf=true which is > not recommended in spark, however hive has no way to avoid this problem。 > test step: > step1: > val df = spark.range(0, 10).toDF("keyid") > .withColumn("col3", expr("keyid")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String] ("sb1", "rz"))) > .withColumn("a2", lit(Array[String] ("sb1", "rz"))) > // create hoodie table hive_14b > merge(df, 4, "default", "hive_14b", >
[GitHub] [hudi] xiarixiaoyao opened a new pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE
xiarixiaoyao opened a new pull request #2722: URL: https://github.com/apache/hudi/pull/2722 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request Fix the bug introduced by HUDI-892 HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions 2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi. hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax .security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request This pull request is already covered by existing tests ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2719: [HUDI-1721] run_sync_tool support hive3
danny0405 commented on a change in pull request #2719: URL: https://github.com/apache/hudi/pull/2719#discussion_r601445623 ## File path: hudi-sync/hudi-hive-sync/run_sync_tool.sh ## @@ -49,6 +49,15 @@ fi HIVE_JACKSON=`ls ${HIVE_HOME}/lib/jackson-*.jar | tr '\n' ':'` HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON +HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'` +if [ -n "$HIVE_CALCITE" ]; then +HIVE_JARS=$HIVE_JARS:$HIVE_CALCITE +fi +HIVE_LIBFB303=`ls ${HIVE_HOME}/lib/libfb303-*.jar | tr '\n' ':'` +if [ -n "$HIVE_LIBFB303" ]; then +HIVE_JARS=$HIVE_JARS:$HIVE_LIBFB303 Review comment: Can you explain why we need these 2 kinds of jars ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE
[ https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1722: --- Description: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions 2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok sparksql can avoid this problem by set spark.hadoop.cloneConf=true which is not recommended in spark, however hive has no way to avoid this problem。 test step: step1: val df = spark.range(0, 10).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // create hoodie table hive_14b merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") notice: bulk_insert will produce 4 files in hoodie table step2: val df = spark.range(9, 12).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // upsert table merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") now : we have four base files and one log file in hoodie table step3: spark-sql/beeline: select count(col3) from hive_14b_rt; then the query failed. was: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the
[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE
[ https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1722: --- Description: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) ... 24 more Caused by: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok sparksql can avoid this problem by set spark.hadoop.cloneConf=true which is not recommended in spark, however hive has no way to avoid this problem。 was: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) ... 24 more Caused by: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93) at
[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE
[ https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1722: --- Description: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) ... 24 more Caused by: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok sparksql can avoid this problem by set spark.hadoop.cloneConf=true which is not recommended in spark, however hive has no way to avoid this problem。 was: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) ... 24 more Caused by: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93) at
[jira] [Created] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE
tao meng created HUDI-1722: -- Summary: hive beeline/spark-sql query specified field on mor table occur NPE Key: HUDI-1722 URL: https://issues.apache.org/jira/browse/HUDI-1722 Project: Apache Hudi Issue Type: Bug Components: Hive Integration, Spark Integration Affects Versions: 0.7.0 Environment: spark2.4.5, hadoop3.1.1, hive 3.1.1 Reporter: tao meng Fix For: 0.9.0 HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) ... 24 more Caused by: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] xiarixiaoyao commented on pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed
xiarixiaoyao commented on pull request #2716: URL: https://github.com/apache/hudi/pull/2716#issuecomment-806584838 @garyli1019 could you help me review this pr, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #2720: [HUDI-1719]hive on spark/mr,Incremental query of the mor table, the partition field is incorrect
xiarixiaoyao commented on pull request #2720: URL: https://github.com/apache/hudi/pull/2720#issuecomment-806584615 @garyli1019 could you help me review this pr, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
xiarixiaoyao commented on pull request #2721: URL: https://github.com/apache/hudi/pull/2721#issuecomment-806584382 @garyli1019 could you help me review this pr, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
xiarixiaoyao edited a comment on pull request #2721: URL: https://github.com/apache/hudi/pull/2721#issuecomment-806582989 test step: before patch: step1: val df = spark.range(0, 100).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // bulk_insert 100w row (keyid from 0 to 100) merge(df, 4, "default", "hive_9b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 90).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // delete 90w row (keyid from 0 to 90) delete(df, 4, "default", "hive_9b") step3: query on beeline/spark-sql : select count(col3) from hive_9b_rt 2021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running child : java.lang.StackOverflowError at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39) at org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344) at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503) at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30) at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.Real timeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCom pactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) After patch: select count(col3) from hive_9b_rt
[GitHub] [hudi] xiarixiaoyao commented on pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
xiarixiaoyao commented on pull request #2721: URL: https://github.com/apache/hudi/pull/2721#issuecomment-806582989 test step: before patch: step1: val df = spark.range(0, 100).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // bulk_insert 100w row (keyid from 0 to 100) merge(df, 4, "default", "hive_9b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 90).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // delete 90w row (keyid from 0 to 90) delete(df, 4, "default", "hive_9b") step3: query on beeline/spark-sql : select count(col3) from hive_9b_rt 2021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running child : java.lang.StackOverflowError at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39) at org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344) at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503) at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30) at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.Real timeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCom pactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) After patch: select count(col3) from hive_9b_rt +-+
[jira] [Commented] (HUDI-1721) run_sync_tool support hive3.1.2 on hadoop3.1.4
[ https://issues.apache.org/jira/browse/HUDI-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308606#comment-17308606 ] Danny Chen commented on HUDI-1721: -- Thanks for the contribution, would review soon ~ > run_sync_tool support hive3.1.2 on hadoop3.1.4 > --- > > Key: HUDI-1721 > URL: https://issues.apache.org/jira/browse/HUDI-1721 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.8.0 >Reporter: 谢波 >Priority: Major > Labels: pull-request-available > > [https://github.com/apache/hudi/issues/2717] > run_sync_tool support hive3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
[ https://issues.apache.org/jira/browse/HUDI-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1720: - Labels: pull-request-available (was: ) > when query incr view of mor table which has many delete records use > sparksql/hive-beeline, StackOverflowError > --- > > Key: HUDI-1720 > URL: https://issues.apache.org/jira/browse/HUDI-1720 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration, Spark Integration >Affects Versions: 0.7.0, 0.8.0 >Reporter: tao meng >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > now RealtimeCompactedRecordReader.next deal with delete record by > recursion, see: > [https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106] > however when the log file contains many delete record, the logcial of > RealtimeCompactedRecordReader.next will lead stackOverflowError > test step: > step1: > val df = spark.range(0, 100).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // bulk_insert 100w row (keyid from 0 to 100) > merge(df, 4, "default", "hive_9b", > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") > step2: > val df = spark.range(0, 90).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // delete 90w row (keyid from 0 to 90) > delete(df, 4, "default", "hive_9b") > step3: > query on beeline/spark-sql : select count(col3) from hive_9b_rt > 2021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, > RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | > INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | > Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running > child : java.lang.StackOverflowError at > org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) > at > org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39) > at > org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344) > at > org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503) > at > org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30) > at > org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at >
[GitHub] [hudi] xiarixiaoyao opened a new pull request #2721: [HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
xiarixiaoyao opened a new pull request #2721: URL: https://github.com/apache/hudi/pull/2721 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request Fix the StackOverflowError on [HUDI-1720] now RealtimeCompactedRecordReader.next deal with delete records by recursion, see: https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106 however when the log file contains many delete record, the logcial of RealtimeCompactedRecordReader.next will lead stackOverflowError we can use Loop instead of recursion。 ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request Manually verified the change by running a job locally ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #2720: [HUDI-1719]hive on spark/mr,Incremental query of the mor table, the partition field is incorrect
xiarixiaoyao commented on pull request #2720: URL: https://github.com/apache/hudi/pull/2720#issuecomment-806562011 test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1 before patch: test step: step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // create hudi table which has three level partitions p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // upsert current table merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is smaller the earlist commit, so we can query whole commits select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' and `keyid` < 5; query result: ++-+-++ | p | p1 | p2 | keyid | ++-+-++ | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 1 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 4 | | 0 | 0 | 6 | 4 | | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 1 | ++-+-++ this result is wrong, since the second step we insert new data in table partition p2=7, however in the query result we cannot find p2=7, all p2= 6 After patch: ++-+-++ | p | p1 | p2 | keyid | ++-+-++ | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 1 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 4 | | 0 | 0 | 7 | 4 | | 0 | 0 | 7 | 0 | | 0 | 0 | 7 | 3 | | 0 | 0 | 7 | 2 | | 0 | 0 | 7 | 1 | ++-+-++ this result is correct. merge function: def merge(df: org.apache.spark.sql.DataFrame, par: Int, db: String, tableName: String, tableType: String = DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, hivePartitionExtract: String = "org.apache.hudi.hive.MultiPartKeysValueExtractor", op: String = "upsert"): Unit = { val mode = if (op.equals("bulk_insert")) { Overwrite } else { Append } df.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType). option(HoodieCompactionConfig.INLINE_COMPACT_PROP, false). option(PRECOMBINE_FIELD_OPT_KEY, "col3"). option(RECORDKEY_FIELD_OPT_KEY, "keyid"). option(PARTITIONPATH_FIELD_OPT_KEY, "p,p1,p2"). option(DataSourceWriteOptions.OPERATION_OPT_KEY, op). option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, classOf[ComplexKeyGenerator].getName). option("hoodie.bulkinsert.shuffle.parallelism", par.toString). option("hoodie.metadata.enable", "false"). option("hoodie.insert.shuffle.parallelism", par.toString). option("hoodie.upsert.shuffle.parallelism", par.toString). option("hoodie.delete.shuffle.parallelism", par.toString). option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true"). option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "p,p1,p2"). option("hoodie.datasource.hive_sync.support_timestamp", "true"). option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true"). option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor"). option(HIVE_USE_JDBC_OPT_KEY, "false"). option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, db). option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName). option(TABLE_NAME, tableName).mode(mode).save(s"/tmp/${db}/${tableName}") } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1719) hive on spark/mr,Incremental query of the mor table, the partition field is incorrect
[ https://issues.apache.org/jira/browse/HUDI-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1719: - Labels: pull-request-available (was: ) > hive on spark/mr,Incremental query of the mor table, the partition field is > incorrect > - > > Key: HUDI-1719 > URL: https://issues.apache.org/jira/browse/HUDI-1719 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.7.0, 0.8.0 > Environment: spark2.4.5, hadoop 3.1.1, hive 3.1.1 >Reporter: tao meng >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the > mor table. > when we have some small files in different partitions, > HoodieCombineHiveInputFormat will combine those small file readers. > HoodieCombineHiveInputFormat build partition field base on the first file > reader in it, however now HoodieCombineHiveInputFormat holds other file > readers which come from different partitions. > When switching readers, we should update ioctx > test env: > spark2.4.5, hadoop 3.1.1, hive 3.1.1 > test step: > step1: > val df = spark.range(0, 1).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(6)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // create hudi table which has three level partitions p,p1,p2 > merge(df, 4, "default", "hive_8b", > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") > > step2: > val df = spark.range(0, 1).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // upsert current table > merge(df, 4, "default", "hive_8b", > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") > hive beeline: > set > hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; > set hoodie.hive_8b.consume.mode=INCREMENTAL; > set hoodie.hive_8b.consume.max.commits=3; > set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp > is smaller the earlist commit, so we can query whole commits > select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where > `_hoodie_commit_time`>'20210325141300' and `keyid` < 5; > query result: > +-+++-+ > |p|p1|p2|keyid| > +-+++-+ > |0|0|6|0| > |0|0|6|1| > |0|0|6|2| > |0|0|6|3| > |0|0|6|4| > |0|0|6|4| > |0|0|6|0| > |0|0|6|3| > |0|0|6|2| > |0|0|6|1| > +-+++-+ > this result is wrong, since the second step we insert new data in table which > p2=7, however in the query result we cannot find p2=7, all p2= 6 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] xiarixiaoyao opened a new pull request #2720: [HUDI-1719]hive on spark/mr,Incremental query of the mor table, the partition field is incorrect
xiarixiaoyao opened a new pull request #2720: URL: https://github.com/apache/hudi/pull/2720 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request fix the bug that, HoodieCombineHiveInputFormat cannot build correct partition field. now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the mor table. when we have some small files in different partitions(file1 from partition p=6 , file2 from partition p=7), HoodieCombineHiveInputFormat will combine those small file readers. HoodieCombineHiveInputFormat build partition field base on the first file reader(assume file1 reader which from parition p=6) , however now HoodieCombineHiveInputFormat holds other file readers(file2 reader which from paritition p=7) which come from different partitions. When switching readers, we should update ioctx https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieCombineRealtimeRecordReader.java#L73 ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request This pull request is already covered by existing tests, such as *(please describe tests)*. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1721) run_sync_tool support hive3.1.2 on hadoop3.1.4
[ https://issues.apache.org/jira/browse/HUDI-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1721: - Labels: pull-request-available (was: ) > run_sync_tool support hive3.1.2 on hadoop3.1.4 > --- > > Key: HUDI-1721 > URL: https://issues.apache.org/jira/browse/HUDI-1721 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.8.0 >Reporter: 谢波 >Priority: Major > Labels: pull-request-available > > [https://github.com/apache/hudi/issues/2717] > run_sync_tool support hive3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] MyLanPangzi opened a new pull request #2719: [HUDI-1721] run_sync_tool support hive3
MyLanPangzi opened a new pull request #2719: URL: https://github.com/apache/hudi/pull/2719 run_sync_tool support hive3 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request run_sync_tool support hive3 ## Brief change log - *Modify run_sync_tool.sh add HIVE_CALCITE and HIVE_LIBFB303 variable* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1721) run_sync_tool support hive3.1.2 on hadoop3.1.4
谢波 created HUDI-1721: Summary: run_sync_tool support hive3.1.2 on hadoop3.1.4 Key: HUDI-1721 URL: https://issues.apache.org/jira/browse/HUDI-1721 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.8.0 Reporter: 谢波 [https://github.com/apache/hudi/issues/2717] run_sync_tool support hive3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1719) hive on spark/mr,Incremental query of the mor table, the partition field is incorrect
[ https://issues.apache.org/jira/browse/HUDI-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1719: --- Description: now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the mor table. when we have some small files in different partitions, HoodieCombineHiveInputFormat will combine those small file readers. HoodieCombineHiveInputFormat build partition field base on the first file reader in it, however now HoodieCombineHiveInputFormat holds other file readers which come from different partitions. When switching readers, we should update ioctx test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1 test step: step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // create hudi table which has three level partitions p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert current table merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is smaller the earlist commit, so we can query whole commits select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' and `keyid` < 5; query result: +-+++-+ |p|p1|p2|keyid| +-+++-+ |0|0|6|0| |0|0|6|1| |0|0|6|2| |0|0|6|3| |0|0|6|4| |0|0|6|4| |0|0|6|0| |0|0|6|3| |0|0|6|2| |0|0|6|1| +-+++-+ this result is wrong, since the second step we insert new data in table which p2=7, however in the query result we cannot find p2=7, all p2= 6 was: now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the mor table. when we have some small files in different partitions, HoodieCombineHiveInputFormat will combine those small file readers. HoodieCombineHiveInputFormat build partition field base on the first file reader in it, however now HoodieCombineHiveInputFormat holds other file readers which come from different partitions. When switching readers, we should update ioctx test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1 test step: step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // create hudi table which has three level partitions p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert current table merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' and `keyid` < 5; query result: ++-+-++ | p | p1 | p2 | keyid | ++-+-++ | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 1 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 4 | | 0 | 0 | 6 | 4 | | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 1 | ++-+-++ this result is wrong, since the second step we insert new data in table which p2=7, however in the query result we cannot find p2=7, all p2= 6 > hive on spark/mr,Incremental query of the mor table, the partition field is > incorrect > - > > Key: HUDI-1719 > URL: https://issues.apache.org/jira/browse/HUDI-1719 > Project: Apache Hudi > Issue Type: Bug > Components: Hive
[jira] [Updated] (HUDI-1495) Bump Flink version to 1.12.0
[ https://issues.apache.org/jira/browse/HUDI-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-1495: - Summary: Bump Flink version to 1.12.0 (was: Upgrade Flink version to 1.12.0) > Bump Flink version to 1.12.0 > > > Key: HUDI-1495 > URL: https://issues.apache.org/jira/browse/HUDI-1495 > Project: Apache Hudi > Issue Type: Task > Components: newbie >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: easyfix, pull-request-available > Fix For: 0.9.0 > > > The apache Flink 1.12.0 has be released, upgrade the version to 1.12.0 in > order to adapter new Flink interfaces. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] danny0405 opened a new pull request #2718: [HUDI-1495] Bump Flink version to 1.12.2
danny0405 opened a new pull request #2718: URL: https://github.com/apache/hudi/pull/2718 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] MyLanPangzi opened a new issue #2717: [SUPPORT] run_sync_tool support hive3.1.2 on hadoop3.1.4
MyLanPangzi opened a new issue #2717: URL: https://github.com/apache/hudi/issues/2717 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** **can we modify the run_sync_tool.sh ?** **i have a hoodie table t1 on hadoop, and i want import into hive. when i execute import shell, get an error:** ./run_sync_tool.sh \ --user hive \ --pass '' \ --jdbc-url jdbc:hive2:\/\/yh001:1 \ --partitioned-by partition \ --base-path hdfs://yh001:9820/hudi/t1 \ --database default \ --table t1 2021-03-25 18:28:27,011 INFO [main] hive.HoodieHiveClient (HoodieHiveClient.java:(87)) - Creating hive connection jdbc:hive2://yh001:1 2021-03-25 18:28:27,283 INFO [main] hive.HoodieHiveClient (HoodieHiveClient.java:createHiveConnection(437)) - Successfully established Hive connection to jdbc:hive2://yh001:1 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/calcite/plan/RelOptRule at org.apache.hudi.hive.HoodieHiveClient.(HoodieHiveClient.java:91) at org.apache.hudi.hive.HiveSyncTool.(HiveSyncTool.java:69) at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:249) Caused by: java.lang.ClassNotFoundException: org.apache.calcite.plan.RelOptRule at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 3 more **after update the shell** HIVE_CALCITE=`ls ${HIVE_HOME}/lib/calcite-*.jar | tr '\n' ':'` HIVE_JARS=$HIVE_METASTORE:$HIVE_SERVICE:$HIVE_EXEC:$HIVE_JDBC:$HIVE_JACKSON:$HIVE_CALCITE **i get anthor error** 1-03-25 18:30:51,256 INFO [main] hive.HoodieHiveClient (HoodieHiveClient.java:createHiveConnection(437)) - Successfully established Hive connection to jdbc:hive2://yh001:1 Exception in thread "main" java.lang.NoClassDefFoundError: com/facebook/fb303/FacebookService$Iface at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.hive.metastore.utils.JavaUtils.getClass(JavaUtils.java:52) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:146) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:119) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:4299) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:4367) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:4347) at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:4603) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:291) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:274) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:435) at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:375) at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:355) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:331) at org.apache.hudi.hive.HoodieHiveClient.(HoodieHiveClient.java:91) at org.apache.hudi.hive.HiveSyncTool.(HiveSyncTool.java:69) at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:249) Caused by: java.lang.ClassNotFoundException: com.facebook.fb303.FacebookService$Iface at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
[GitHub] [hudi] xiarixiaoyao edited a comment on pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed
xiarixiaoyao edited a comment on pull request #2716: URL: https://github.com/apache/hudi/pull/2716#issuecomment-806535452 test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1 before patch: step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit( Array[String] ("sb1", "rz") ) ) .withColumn("a2", lit( Array[String] ("sb1", "rz") ) ) // bulk_insert df, partition by p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit( Array[String] ("sb1", "rz") ) ) .withColumn("a2", lit( Array[String] ("sb1", "rz") ) ) // upsert table hive8b merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") step3: start hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is smaller the earlist commit, so we can query whole commits select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: **p,p1,p2** 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: **p,p1,p2** at org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268) at org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286) at org.apache.hudi.hadoop.realtime.AbstractReal timeRecordReader.init(AbstractRealtimeRecordReader.java:98) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) at org.apache.hadoop. mapred.MapTask.runOldMapper(MapTask.java:444) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) after patch: ++-+-++ | p | p1 | p2 | keyid | ++-+-++ | 0 | 0 | 6 | 5068 | | 0 | 0 | 6 | 6058 | | 0 | 0 | 6 | 823| | 0 | 0 | 6 | 5031 | | 0 | 0 | 6 | 4445 | | 0 | 0 | 6 | 5082 | merge function: def merge(df: org.apache.spark.sql.DataFrame, par: Int, db: String, tableName: String, tableType: String = DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, hivePartitionExtract: String = "org.apache.hudi.hive.MultiPartKeysValueExtractor", op: String = "upsert"): Unit = { val mode = if (op.equals("bulk_insert")) { Overwrite } else { Append } df.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType). option(HoodieCompactionConfig.INLINE_COMPACT_PROP, false).
[GitHub] [hudi] xiarixiaoyao commented on pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed
xiarixiaoyao commented on pull request #2716: URL: https://github.com/apache/hudi/pull/2716#issuecomment-806535452 test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1 before patch: step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // bulk_insert df, partition by p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert table hive8b merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") step3: start hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is smaller the earlist commit, so we can query whole commits select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: **p,p1,p2** 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: **p,p1,p2** at org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268) at org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286) at org.apache.hudi.hadoop.realtime.AbstractReal timeRecordReader.init(AbstractRealtimeRecordReader.java:98) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) at org.apache.hadoop. mapred.MapTask.runOldMapper(MapTask.java:444) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) after patch: ++-+-++ | p | p1 | p2 | keyid | ++-+-++ | 0 | 0 | 6 | 5068 | | 0 | 0 | 6 | 6058 | | 0 | 0 | 6 | 823| | 0 | 0 | 6 | 5031 | | 0 | 0 | 6 | 4445 | | 0 | 0 | 6 | 5082 | merge function: def merge(df: org.apache.spark.sql.DataFrame, par: Int, db: String, tableName: String, tableType: String = DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL, hivePartitionExtract: String = "org.apache.hudi.hive.MultiPartKeysValueExtractor", op: String = "upsert"): Unit = { val mode = if (op.equals("bulk_insert")) { Overwrite } else { Append } df.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType). option(HoodieCompactionConfig.INLINE_COMPACT_PROP, false).
[jira] [Updated] (HUDI-1718) when query incr view of mor table which has Multi level partitions, the query failed
[ https://issues.apache.org/jira/browse/HUDI-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1718: - Labels: pull-request-available (was: ) > when query incr view of mor table which has Multi level partitions, the > query failed > - > > Key: HUDI-1718 > URL: https://issues.apache.org/jira/browse/HUDI-1718 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.7.0, 0.8.0 >Reporter: tao meng >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive > use "/" to join muit1 partitions. there exists some gap, so modify > HoodieCombineHiveInputFormat's logical > test env > spark2.4.5, hadoop 3.1.1, hive 3.1.1 > > step1: > val df = spark.range(0, 1).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(6)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // bulk_insert df, partition by p,p1,p2 > merge(df, 4, "default", "hive_8b", > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") > step2: > val df = spark.range(0, 1).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // upsert table hive8b > merge(df, 4, "default", "hive_8b", > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") > step3: > start hive beeline: > set > hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; > set hoodie.hive_8b.consume.mode=INCREMENTAL; > set hoodie.hive_8b.consume.max.commits=3; > set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp > is smaller the earlist commit, so we can query whole commits > select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where > `_hoodie_commit_time`>'20210325141300' > > 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics > report from attempt_1615883368881_0028_m_00_3: Error: > org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: > p,p1,p2 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | > Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: > org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: > p,p1,p2 at > org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at > org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at > org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at > org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at > org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268) > at > org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286) > at > org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:98) > at > org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53) > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70) > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47) > at > org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at > org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] xiarixiaoyao opened a new pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed
xiarixiaoyao opened a new pull request #2716: URL: https://github.com/apache/hudi/pull/2716 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* HoodieCombineHiveInputFormat use "," to join multi partitions, however hive use "/" to join multi partitions. there exists some gap, so modify HoodieCombineHiveInputFormat's logical *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. This pull request is already covered by existing tests ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1718) when query incr view of mor table which has Multi level partitions, the query failed
[ https://issues.apache.org/jira/browse/HUDI-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1718: --- Description: HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use "/" to join muit1 partitions. there exists some gap, so modify HoodieCombineHiveInputFormat's logical test env spark2.4.5, hadoop 3.1.1, hive 3.1.1 step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // bulk_insert df, partition by p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert table hive8b merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") step3: start hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is smaller the earlist commit, so we can query whole commits select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: p,p1,p2 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: p,p1,p2 at org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268) at org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:98) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) was: HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use "/" to join muit1 partitions. there exists some gap, so modify HoodieCombineHiveInputFormat's logical test env spark2.4.5, hadoop 3.1.1, hive 3.1.1 step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // bulk_insert df, partition by p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7))
[jira] [Created] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
tao meng created HUDI-1720: -- Summary: when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError Key: HUDI-1720 URL: https://issues.apache.org/jira/browse/HUDI-1720 Project: Apache Hudi Issue Type: Bug Components: Hive Integration, Spark Integration Affects Versions: 0.7.0, 0.8.0 Reporter: tao meng Fix For: 0.9.0 now RealtimeCompactedRecordReader.next deal with delete record by recursion, see: [https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106] however when the log file contains many delete record, the logcial of RealtimeCompactedRecordReader.next will lead stackOverflowError test step: step1: val df = spark.range(0, 100).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // bulk_insert 100w row (keyid from 0 to 100) merge(df, 4, "default", "hive_9b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 90).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // delete 90w row (keyid from 0 to 90) delete(df, 4, "default", "hive_9b") step3: query on beeline/spark-sql : select count(col3) from hive_9b_rt 2021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running child : java.lang.StackOverflowError at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39) at org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344) at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503) at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30) at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601290387 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java ## @@ -193,6 +195,14 @@ public String getPreCombineField() { return props.getProperty(HOODIE_TABLE_PRECOMBINE_FIELD); } + public Option getPartitionColumns() { Review comment: For the previous version of hudi, There are no partition column stored in the hoodie.properties. So It return an `Option#empty` for this case. We can distinguish the case of empty partitions and not store partition columns. I saw other property like `getBootstrapBasePath` also use the Option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1719) hive on spark/mr,Incremental query of the mor table, the partition field is incorrect
tao meng created HUDI-1719: -- Summary: hive on spark/mr,Incremental query of the mor table, the partition field is incorrect Key: HUDI-1719 URL: https://issues.apache.org/jira/browse/HUDI-1719 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.7.0, 0.8.0 Environment: spark2.4.5, hadoop 3.1.1, hive 3.1.1 Reporter: tao meng Fix For: 0.9.0 now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the mor table. when we have some small files in different partitions, HoodieCombineHiveInputFormat will combine those small file readers. HoodieCombineHiveInputFormat build partition field base on the first file reader in it, however now HoodieCombineHiveInputFormat holds other file readers which come from different partitions. When switching readers, we should update ioctx test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1 test step: step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // create hudi table which has three level partitions p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert current table merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' and `keyid` < 5; query result: ++-+-++ | p | p1 | p2 | keyid | ++-+-++ | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 1 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 4 | | 0 | 0 | 6 | 4 | | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 1 | ++-+-++ this result is wrong, since the second step we insert new data in table which p2=7, however in the query result we cannot find p2=7, all p2= 6 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1718) when query incr view of mor table which has Multi level partitions, the query failed
tao meng created HUDI-1718: -- Summary: when query incr view of mor table which has Multi level partitions, the query failed Key: HUDI-1718 URL: https://issues.apache.org/jira/browse/HUDI-1718 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.7.0, 0.8.0 Reporter: tao meng Fix For: 0.9.0 HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use "/" to join muit1 partitions. there exists some gap, so modify HoodieCombineHiveInputFormat's logical test env spark2.4.5, hadoop 3.1.1, hive 3.1.1 step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // bulk_insert df, partition by p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert table hive8b merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") step3: start hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is smaller the earlist commit, so we can query whole commits 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: p,p1,p2 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: p,p1,p2 at org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268) at org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:98) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] pengzhiwei2018 commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on pull request #2645: URL: https://github.com/apache/hudi/pull/2645#issuecomment-806486974 > @pengzhiwei2018 Thank you for your reply。 > spark3 has it's own analyze logic and optimize logic. and those logicals are continual enhanced. > for example: > [apache/spark@d99135b](https://github.com/apache/spark/commit/d99135b66ab4ebab69973800990e12845c4ffeaa) > [apache/spark@1ad3432](https://github.com/apache/spark/commit/1ad343238cb82a79e51ae9d46ae704bc482ff437) > [apache/spark@4d56d43](https://github.com/apache/spark/commit/4d56d438386049b5f481ec83b69e3c89807be201) > and so on. > i think it will be better to respect spark3's logcials. Thanks for the input for me @xiarixiaoyao . I will take a look at this~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
xiarixiaoyao commented on pull request #2645: URL: https://github.com/apache/hudi/pull/2645#issuecomment-806485150 @pengzhiwei2018 Thank you for your reply。 spark3 has it's own analyze logic and optimize logic. and those logicals are continual enhanced. for example: https://github.com/apache/spark/commit/d99135b66ab4ebab69973800990e12845c4ffeaa https://github.com/apache/spark/commit/1ad343238cb82a79e51ae9d46ae704bc482ff437 https://github.com/apache/spark/commit/4d56d438386049b5f481ec83b69e3c89807be201 and so on. i think it will be better to respect spark3's logcials. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601225876 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala ## @@ -79,39 +81,52 @@ class DefaultSource extends RelationProvider val allPaths = path.map(p => Seq(p)).getOrElse(Seq()) ++ readPaths val fs = FSUtils.getFs(allPaths.head, sqlContext.sparkContext.hadoopConfiguration) -val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs) - -val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray) +// Use the HoodieFileIndex only if the 'path' has specified with no "*" contains. +// And READ_PATHS_OPT_KEY has not specified. +// Or else we use the original way to read hoodie table. +val useHoodieFileIndex = path.isDefined && !path.get.contains("*") && Review comment: Hi @vinothchandar , Currently user query the hoodie table must specify some stars at the path. If the path has not specified the stars, the old way of query hoodie table may not work(exception or get no data). So we must use the `HoodieFileIndex` for this case by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] aditiwari01 edited a comment on issue #2675: [SUPPORT] Unable to query MOR table after schema evolution
aditiwari01 edited a comment on issue #2675: URL: https://github.com/apache/hudi/issues/2675#issuecomment-806450032 > * do you use the RowBasedSchemaProvider and hence can't explicitly provide schema? If you were to use your own schema registry, you might as well provide an updated schema to hudi while writing. > * got it. would be nice to have some contribution. I can help review the patch. > In the mean time, I will give it a try schema evolution on my end with some local set up. 1. Yes, RowBasedSchemaProvider is used in creating GenericRecord. Not sure on how to pass row schema with defaults to spark writer (if at all its possible.). Also can't fine any way to pass custom SchemaProvoder class in writer options. 2. The idea for path is to update to avro schema with default of null for all nullable fields. If you wish you can refer function **getAvroSchemaWithDefs** in my local branch: https://github.com/aditiwari01/hudi/commit/68a7ad20cc570f88940f2481d0ae19986254f1d8#diff-21ccee08cebace9de801e104f356bf4333c017d5d5c0cac4e3de63c7861f3c13R58 This is bit unstructured though. Will try to have a formal PR with tests and all by next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601225876 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala ## @@ -79,39 +81,52 @@ class DefaultSource extends RelationProvider val allPaths = path.map(p => Seq(p)).getOrElse(Seq()) ++ readPaths val fs = FSUtils.getFs(allPaths.head, sqlContext.sparkContext.hadoopConfiguration) -val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs) - -val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray) +// Use the HoodieFileIndex only if the 'path' has specified with no "*" contains. +// And READ_PATHS_OPT_KEY has not specified. +// Or else we use the original way to read hoodie table. +val useHoodieFileIndex = path.isDefined && !path.get.contains("*") && Review comment: Hi @vinothchandar , Currently user query the hoodie table must specify some "*" at the path. If the path has not specified the "*", the old way of query hoodie table may not work(exception or get no data). So we must use the `HoodieFileIndex` for this case by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601220471 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -112,12 +112,15 @@ private[hudi] object HoodieSparkSqlWriter { val archiveLogFolder = parameters.getOrElse( HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, "archived") +val partitionColumns = parameters.getOrElse(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, null) Review comment: Hi @vinothchandar , we need get the partition schema and pass it to the `HadoopFsRelation`,just like this: HadoopFsRelation( fileIndex, fileIndex.partitionSchema, fileIndex.dataSchema, bucketSpec = None, fileFormat = new ParquetFileFormat, optParams)(sqlContext.sparkSession) By the generator, it is hard to construct the partition schema. So we need the partition columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601220471 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -112,12 +112,15 @@ private[hudi] object HoodieSparkSqlWriter { val archiveLogFolder = parameters.getOrElse( HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, "archived") +val partitionColumns = parameters.getOrElse(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, null) Review comment: Hi @vinothchandar , we need get the partition schema and pass it to the `HadoopFsRelation`,like this: HadoopFsRelation( fileIndex, fileIndex.partitionSchema, fileIndex.dataSchema, bucketSpec = None, fileFormat = new ParquetFileFormat, optParams)(sqlContext.sparkSession) By the generator, it is hard to construct the partition schema. So we need the partition columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601220471 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -112,12 +112,15 @@ private[hudi] object HoodieSparkSqlWriter { val archiveLogFolder = parameters.getOrElse( HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, "archived") +val partitionColumns = parameters.getOrElse(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, null) Review comment: Hi @vinothchandar , we need get the partition schema and pass it to the `HadoopFsRelation`,like this: HadoopFsRelation( fileIndex, fileIndex.partitionSchema, fileIndex.dataSchema, bucketSpec = None, fileFormat = new ParquetFileFormat, optParams)(sqlContext.sparkSession) By the generator, it is hard to construct the partition schema. So we need the partition columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601214645 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java ## @@ -57,6 +58,7 @@ public static final String HOODIE_TABLE_TYPE_PROP_NAME = "hoodie.table.type"; public static final String HOODIE_TABLE_VERSION_PROP_NAME = "hoodie.table.version"; public static final String HOODIE_TABLE_PRECOMBINE_FIELD = "hoodie.table.precombine.field"; + public static final String HOODIE_TABLE_PARTITION_COLUMNS = "hoodie.table.partition.columns"; Review comment: Hi @vinothchandar , we need the partition schema for partition prune for spark sql. So the partition columns is need for that. Or we cannot get the partition schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601213171 ## File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java ## @@ -276,6 +276,13 @@ public static void processFiles(FileSystem fs, String basePathStr, Function
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601212827 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala ## @@ -0,0 +1,349 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi + +import java.util.Properties + +import scala.collection.JavaConverters._ +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.{HoodieMetadataConfig, SerializableConfiguration} +import org.apache.hudi.common.engine.HoodieLocalEngineContext +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.HoodieBaseFile +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.{InternalRow, expressions} +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.avro.SchemaConverters +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, BoundReference, Expression, InterpretedPredicate} +import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils} +import org.apache.spark.sql.execution.datasources.{FileIndex, FileStatusCache, NoopCache, PartitionDirectory, PartitionUtils} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +import scala.collection.mutable + +/** + * A File Index which support partition prune for hoodie snapshot and read-optimized + * query. + * Main steps to get the file list for query: + * 1、Load all files and partition values from the table path. + * 2、Do the partition prune by the partition filter condition. + * + * There are 3 cases for this: + * 1、If the partition columns size is equal to the actually partition path level, we + * read it as partitioned table.(e.g partition column is "dt", the partition path is "2021-03-10") + * + * 2、If the partition columns size is not equal to the partition path level, but the partition + * column size is "1" (e.g. partition column is "dt", but the partition path is "2021/03/10" + * who'es directory level is 3).We can still read it as a partitioned table. We will mapping the + * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt"). + * + * 3、Else the the partition columns size is not equal to the partition directory level and the + * size is great than "1" (e.g. partition column is "dt,hh", the partition path is "2021/03/10/12") + * , we read it as a None Partitioned table because we cannot know how to mapping the partition + * path with the partition columns in this case. + */ +case class HoodieFileIndex( + spark: SparkSession, + metaClient: HoodieTableMetaClient, + schemaSpec: Option[StructType], + options: Map[String, String], + @transient fileStatusCache: FileStatusCache = NoopCache) + extends FileIndex with Logging { + + private val basePath = metaClient.getBasePath + + @transient private val queryPath = new Path(options.getOrElse("path", "'path' option required")) + /** +* Get the schema of the table. +*/ + lazy val schema: StructType = schemaSpec.getOrElse({ +val schemaUtil = new TableSchemaResolver(metaClient) +SchemaConverters.toSqlType(schemaUtil.getTableAvroSchema) + .dataType.asInstanceOf[StructType] + }) + + /** +* Get the partition schema from the hoodie.properties. +*/ + private lazy val _partitionSchemaFromProperties: StructType = { +val tableConfig = metaClient.getTableConfig +val partitionColumns = tableConfig.getPartitionColumns +val nameFieldMap = schema.fields.map(filed => filed.name -> filed).toMap + +if (partitionColumns.isPresent) { + val partitionFields = partitionColumns.get().map(column => +nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column: '" + +
[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 edited a comment on pull request #2645: URL: https://github.com/apache/hudi/pull/2645#issuecomment-806465222 > Let me see how/if we can simplify the inputSchema vs writeSchema thing. > > I went over the PR now. LGTM at a high level. > Few questions though > > * I see we are introducing some antlr parsing and inject a custom parse for spark 2.x. Is this done for backwards compat with Spark 2 and will be eventually removed? > * Do we reuse the MERGE/DELETE keywords from Spark 3? Is Spark 3 and Spark 2 syntax different. Can you comment on how we are approaching all this. > * Have you done any production testing of this PR? > > cc @kwondw could you also please chime in. We would like to land something basic and iterate and get this out for 0.9.0 next month. Thanks for you review @vinothchandar ! > I see we are introducing some antlr parsing and inject a custom parse for spark 2.x. Is this done for backwards compat with Spark 2 and will be eventually removed? Yes, It is for backwards for Spark2 and will be eventually removed for spark3 if there are no other syntax extend for the spark3. > Do we reuse the MERGE/DELETE keywords from Spark 3? Is Spark 3 and Spark 2 syntax different. Can you comment on how we are approaching all this. Yes ,I reused the extended syntax( MERGE) from spark 3. So they are the same between spark2 and spark3 in the syntax. For spark3, spark can recognize the MERGE/DELTE syntax and parser it to LogicalPlan. For spark2, our extended sql parser will also parser it to the some LogicalPlan. After the parser, the LogicalPlan will goes to the same Rules(In `HoodieAnalysis`) to resolve and rewrite to Hoodie Command. Hoodie Command will translate the logical plan to the hoodie api call.The Hoodie Command shared between spark2 & spark 3. So except the sql parser for spark2, other parts can share between spark2 & spark3. > Have you done any production testing of this PR? Yes, I have test it in Aliyun's EMR cluster. And more test case will be done this week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on pull request #2645: URL: https://github.com/apache/hudi/pull/2645#issuecomment-806471861 > @pengzhiwei2018 it's a great work, thanks. MERGE/DELETE expresssions(MergeIntoTable, MergeAction... ) in your pr is copy from v2Commands.scala in spark 3.0 which will lead to class conflict if we implement sql suppor for spark3 later. could you shade those keywords > > @vinothchandar > MERGE/DELETE keywords from Spark 3 is incompatible with this pr, however it's not a problem,we can introduce a extra module hudi-spark3-extensions to resolve those incompatible, i will put forward a new pr for spark3 next few days。 Hi @xiarixiaoyao , Thanks for the review! There is already exists a `hudi-spark3` module in the hudi project. I think we should not need to add a hudi-spark3-extensions. The `MergeIntoTable` has the same package and class name with the that in spark3, So that they can use the same rules `HoodieAnalysis` for resolve and rewrite. I think it will not conflict with the spark3 because we will not build the spark2 module together with the spark3. Only one `MergeIntoTable` will package in the bundle jar. > MERGE/DELETE keywords from Spark 3 is incompatible with this pr I am not understand the incompatible. They have the same syntax and logical plan between spark2 and spark3. Can your help me out? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on pull request #2645: URL: https://github.com/apache/hudi/pull/2645#issuecomment-806465222 > Let me see how/if we can simplify the inputSchema vs writeSchema thing. > > I went over the PR now. LGTM at a high level. > Few questions though > > * I see we are introducing some antlr parsing and inject a custom parse for spark 2.x. Is this done for backwards compat with Spark 2 and will be eventually removed? > * Do we reuse the MERGE/DELETE keywords from Spark 3? Is Spark 3 and Spark 2 syntax different. Can you comment on how we are approaching all this. > * Have you done any production testing of this PR? > > cc @kwondw could you also please chime in. We would like to land something basic and iterate and get this out for 0.9.0 next month. Thanks for you review @vinothchandar ! > I see we are introducing some antlr parsing and inject a custom parse for spark 2.x. Is this done for backwards compat with Spark 2 and will be eventually removed? Yes, It is for backwards for Spark2 and will be eventually removed for spark3 if there are no other syntax extend for the spark3. > Do we reuse the MERGE/DELETE keywords from Spark 3? Is Spark 3 and Spark 2 syntax different. Can you comment on how we are approaching all this. Yes ,I reused the extended syntax( MERGE) from spark 3. So they are the same between spark2 and spark3 in the syntax. For spark3, spark can recognize the MERGE/DELTE syntax and parser it to LogicalPlan. For spark2, our extended sql parser will also parser it to the some LogicalPlan. After the parser, the LogicalPlan will goes to the same Rules(In `HoodieAnalysis`) to resolve and rewrite to Hoodie Command. Hoodie Command will translate the logical plan to the hoodie api call.The Hoodie Command shared between spark2 & spark 3. So except the sql parser for spark2, other parts can share between spark2 & spark3. > Have you done any production testing of this PR? Yes, I have test it in EMR cluster. And more test case will be done this week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on a change in pull request #2645: URL: https://github.com/apache/hudi/pull/2645#discussion_r601159174 ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/HoodieBaseSqlTest.scala ## @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi + +import java.io.File + +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.util.Utils +import org.scalactic.source +import org.scalatest.{BeforeAndAfterAll, FunSuite, Tag} + +class HoodieBaseSqlTest extends FunSuite with BeforeAndAfterAll { Review comment: Yes, good suggestions! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2710: [RFC-20][HUDI-648] Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes
codecov-io edited a comment on pull request #2710: URL: https://github.com/apache/hudi/pull/2710#issuecomment-806393474 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=h1) Report > Merging [#2710](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=desc) (f9e9a12) into [master](https://codecov.io/gh/apache/hudi/commit/d7b18783bdd6edd6355ee68714982401d3321f86?el=desc) (d7b1878) will **decrease** coverage by `0.12%`. > The diff coverage is `6.25%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2710/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2710 +/- ## - Coverage 51.76% 51.63% -0.13% - Complexity 3601 3603 +2 Files 476 478 +2 Lines 2258322641 +58 Branches 2408 2417 +9 + Hits 1168911691 +2 - Misses 9877 9928 +51 - Partials 1017 1022 +5 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `50.73% <2.22%> (-0.21%)` | `0.00 <1.00> (ø)` | | | hudiflink | `54.08% <ø> (-0.20%)` | `0.00 <ø> (ø)` | | | hudihadoopmr | `33.42% <66.66%> (-0.02%)` | `0.00 <0.00> (ø)` | | | hudisparkdatasource | `70.87% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `45.58% <ø> (ø)` | `0.00 <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `69.73% <ø> (ø)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...che/hudi/common/config/HoodieErrorTableConfig.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Ib29kaWVFcnJvclRhYmxlQ29uZmlnLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...on/model/OverwriteWithLatestAvroSchemaPayload.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZVdpdGhMYXRlc3RBdnJvU2NoZW1hUGF5bG9hZC5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...va/org/apache/hudi/common/util/TablePathUtils.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvVGFibGVQYXRoVXRpbHMuamF2YQ==) | `59.52% <0.00%> (-8.05%)` | `17.00 <1.00> (+1.00)` | :arrow_down: | | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh) | `57.79% <ø> (ø)` | `0.00 <0.00> (ø)` | | | [...rg/apache/hudi/hadoop/HoodieROTablePathFilter.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0hvb2RpZVJPVGFibGVQYXRoRmlsdGVyLmphdmE=) | `64.19% <66.66%> (-0.81%)` | `14.00 <0.00> (ø)` | | | [...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh) | `68.44% <100.00%> (+0.12%)` | `43.00 <0.00> (ø)` | | | [...java/org/apache/hudi/table/HoodieTableFactory.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZUZhY3RvcnkuamF2YQ==) | `72.72% <0.00%> (-5.33%)` | `11.00% <0.00%> (ø%)` | | | [.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh) | `61.44% <0.00%> (-3.53%)` | `28.00% <0.00%> (ø%)` | | | [...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==) | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | | | ... and [2 more](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree-more) | |
[GitHub] [hudi] vinothchandar commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
vinothchandar commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601183166 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala ## @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi + +import java.util.Properties + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.{HoodieMetadataConfig, SerializableConfiguration} +import org.apache.hudi.common.engine.HoodieLocalEngineContext +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.HoodieBaseFile +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.config.HoodieWriteConfig +import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.{InternalRow, expressions} +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.avro.SchemaConverters +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, BoundReference, Expression, InterpretedPredicate} +import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, DateTimeUtils} +import org.apache.spark.sql.execution.datasources.{FileIndex, PartitionDirectory, PartitionUtils} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +/** + * A File Index which support partition prune for hoodie snapshot and read-optimized + * query. + * Main steps to get the file list for query: + * 1、Load all files and partition values from the table path. + * 2、Do the partition prune by the partition filter condition. + * + * There are 3 cases for this: + * 1、If the partition columns size is equal to the actually partition path level, we + * read it as partitioned table.(e.g partition column is "dt", the partition path is "2021-03-10") + * + * 2、If the partition columns size is not equal to the partition path level, but the partition + * column size is "1" (e.g. partition column is "dt", but the partition path is "2021/03/10" + * who'es directory level is 3).We can still read it as a partitioned table. We will mapping the + * partition path (e.g. 2021/03/10) to the only partition column (e.g. "dt"). + * + * 3、Else the the partition columns size is not equal to the partition directory level and the + * size is great than "1" (e.g. partition column is "dt,hh", the partition path is "2021/03/10/12") + * , we read it as a None Partitioned table because we cannot know how to mapping the partition + * path with the partition columns in this case. + */ +case class HoodieFileIndex( + spark: SparkSession, + basePath: String, + schemaSpec: Option[StructType], + options: Map[String, String]) + extends FileIndex with Logging { + + @transient private val hadoopConf = spark.sessionState.newHadoopConf() + private lazy val metaClient = HoodieTableMetaClient +.builder().setConf(hadoopConf).setBasePath(basePath).build() + + @transient private val queryPath = new Path(options.getOrElse("path", "'path' option required")) + /** +* Get the schema of the table. +*/ + lazy val schema: StructType = schemaSpec.getOrElse({ +val schemaUtil = new TableSchemaResolver(metaClient) +SchemaConverters.toSqlType(schemaUtil.getTableAvroSchema) + .dataType.asInstanceOf[StructType] + }) + + /** +* Get the partition schema from the hoodie.properties. +*/ + private lazy val _partitionSchemaFromProperties: StructType = { +val tableConfig = metaClient.getTableConfig +val partitionColumns = tableConfig.getPartitionColumns +val nameFieldMap = schema.fields.map(filed => filed.name -> filed).toMap + +if (partitionColumns.isPresent) { + val partitionFields = partitionColumns.get().map(column => +nameFieldMap.getOrElse(column, throw new IllegalArgumentException(s"Cannot find column " +
[GitHub] [hudi] vinothchandar commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
vinothchandar commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r601170320 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java ## @@ -57,6 +58,7 @@ public static final String HOODIE_TABLE_TYPE_PROP_NAME = "hoodie.table.type"; public static final String HOODIE_TABLE_VERSION_PROP_NAME = "hoodie.table.version"; public static final String HOODIE_TABLE_PRECOMBINE_FIELD = "hoodie.table.precombine.field"; + public static final String HOODIE_TABLE_PARTITION_COLUMNS = "hoodie.table.partition.columns"; Review comment: I think we should persist the key generator class. and not the partition columns themselves? let me think over this more. ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java ## @@ -193,6 +195,14 @@ public String getPreCombineField() { return props.getProperty(HOODIE_TABLE_PRECOMBINE_FIELD); } + public Option getPartitionColumns() { Review comment: why not use an empty array to signify no partition columns? ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala ## @@ -79,39 +81,52 @@ class DefaultSource extends RelationProvider val allPaths = path.map(p => Seq(p)).getOrElse(Seq()) ++ readPaths val fs = FSUtils.getFs(allPaths.head, sqlContext.sparkContext.hadoopConfiguration) -val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs) - -val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray) +// Use the HoodieFileIndex only if the 'path' has specified with no "*" contains. +// And READ_PATHS_OPT_KEY has not specified. +// Or else we use the original way to read hoodie table. +val useHoodieFileIndex = path.isDefined && !path.get.contains("*") && Review comment: can we guard this feature itself with a data source option as well. i.e `val useHoodieFileIndex = hoodieFileIndexEnabled && path.isDefined && !path.get.contains("*") && ..` ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala ## @@ -112,12 +112,15 @@ private[hudi] object HoodieSparkSqlWriter { val archiveLogFolder = parameters.getOrElse( HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, "archived") +val partitionColumns = parameters.getOrElse(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, null) Review comment: the partition path can also be generated from a key generator, right? we need to think this case through more. ## File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java ## @@ -276,6 +276,13 @@ public static void processFiles(FileSystem fs, String basePathStr, Function
[GitHub] [hudi] codecov-io edited a comment on pull request #2710: [RFC-20][HUDI-648] Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes
codecov-io edited a comment on pull request #2710: URL: https://github.com/apache/hudi/pull/2710#issuecomment-806393474 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=h1) Report > Merging [#2710](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=desc) (f9e9a12) into [master](https://codecov.io/gh/apache/hudi/commit/d7b18783bdd6edd6355ee68714982401d3321f86?el=desc) (d7b1878) will **decrease** coverage by `0.12%`. > The diff coverage is `6.25%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2710/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2710 +/- ## - Coverage 51.76% 51.63% -0.13% - Complexity 3601 3603 +2 Files 476 478 +2 Lines 2258322641 +58 Branches 2408 2417 +9 + Hits 1168911691 +2 - Misses 9877 9928 +51 - Partials 1017 1022 +5 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `50.73% <2.22%> (-0.21%)` | `0.00 <1.00> (ø)` | | | hudiflink | `54.08% <ø> (-0.20%)` | `0.00 <ø> (ø)` | | | hudihadoopmr | `33.42% <66.66%> (-0.02%)` | `0.00 <0.00> (ø)` | | | hudisparkdatasource | `70.87% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `45.58% <ø> (ø)` | `0.00 <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `69.73% <ø> (ø)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2710?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...che/hudi/common/config/HoodieErrorTableConfig.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Ib29kaWVFcnJvclRhYmxlQ29uZmlnLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...on/model/OverwriteWithLatestAvroSchemaPayload.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZVdpdGhMYXRlc3RBdnJvU2NoZW1hUGF5bG9hZC5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | | | [...va/org/apache/hudi/common/util/TablePathUtils.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvVGFibGVQYXRoVXRpbHMuamF2YQ==) | `59.52% <0.00%> (-8.05%)` | `17.00 <1.00> (+1.00)` | :arrow_down: | | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh) | `57.79% <ø> (ø)` | `0.00 <0.00> (ø)` | | | [...rg/apache/hudi/hadoop/HoodieROTablePathFilter.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0hvb2RpZVJPVGFibGVQYXRoRmlsdGVyLmphdmE=) | `64.19% <66.66%> (-0.81%)` | `14.00 <0.00> (ø)` | | | [...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh) | `68.44% <100.00%> (+0.12%)` | `43.00 <0.00> (ø)` | | | [...java/org/apache/hudi/table/HoodieTableFactory.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZUZhY3RvcnkuamF2YQ==) | `72.72% <0.00%> (-5.33%)` | `11.00% <0.00%> (ø%)` | | | [.../java/org/apache/hudi/table/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9Ib29kaWVUYWJsZVNvdXJjZS5qYXZh) | `61.44% <0.00%> (-3.53%)` | `28.00% <0.00%> (ø%)` | | | [...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==) | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | | | ... and [2 more](https://codecov.io/gh/apache/hudi/pull/2710/diff?src=pr=tree-more) | |
[GitHub] [hudi] aditiwari01 edited a comment on issue #2675: [SUPPORT] Unable to query MOR table after schema evolution
aditiwari01 edited a comment on issue #2675: URL: https://github.com/apache/hudi/issues/2675#issuecomment-806450032 > * do you use the RowBasedSchemaProvider and hence can't explicitly provide schema? If you were to use your own schema registry, you might as well provide an updated schema to hudi while writing. > * got it. would be nice to have some contribution. I can help review the patch. > In the mean time, I will give it a try schema evolution on my end with some local set up. 1. Not sure on how to pass row schema with defaults to spark writer (if at all its possible.) 2. The idea for path is to update to avro schema with default of null for all nullable fields. If you wish you can refer function **getAvroSchemaWithDefs** in my local branch: https://github.com/aditiwari01/hudi/commit/68a7ad20cc570f88940f2481d0ae19986254f1d8#diff-21ccee08cebace9de801e104f356bf4333c017d5d5c0cac4e3de63c7861f3c13R58 This is bit unstructured though. Will try to have a formal PR with tests and all by next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] aditiwari01 commented on issue #2675: [SUPPORT] Unable to query MOR table after schema evolution
aditiwari01 commented on issue #2675: URL: https://github.com/apache/hudi/issues/2675#issuecomment-806450032 > * do you use the RowBasedSchemaProvider and hence can't explicitly provide schema? If you were to use your own schema registry, you might as well provide an updated schema to hudi while writing. > * got it. would be nice to have some contribution. I can help review the patch. > In the mean time, I will give it a try schema evolution on my end with some local set up. 1. Not sure on how to pass row schema with defaults to spark writer (if at all its possible.) 2. The idea for path is to update to avro schema with default of null for all nullable fields. If you wish you can refer function **getAvroSchemaWithDefs** in my local branch: https://github.com/aditiwari01/hudi/commit/68a7ad20cc570f88940f2481d0ae19986254f1d8#diff-21ccee08cebace9de801e104f356bf4333c017d5d5c0cac4e3de63c7861f3c13R58 This is bit unstructured though. Will try to have a formal PR with tests and all. y next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on pull request #2645: URL: https://github.com/apache/hudi/pull/2645#issuecomment-806444132 > Can we also handle DELETE in this PR itself? That way, we have some basic support for all of the major DMLs > Can we also handle DELETE in this PR itself? That way, we have some basic support for all of the major DMLs Yes ,I will add the implementation of DELTE & UPDATE to the PR too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on a change in pull request #2645: URL: https://github.com/apache/hudi/pull/2645#discussion_r601154948 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java ## @@ -617,6 +619,16 @@ public PropertyBuilder setTableName(String tableName) { return this; } +public PropertyBuilder setTableSchema(String tableSchema) { + this.tableSchema = tableSchema; + return this; +} + +public PropertyBuilder setRowKeyFields(String rowKeyFields) { Review comment: Hi @vinothchandar , I am afraid we cannot do this for sql. The common way to specify the primary key in sql is by the row key fields, just like most of the database does. I think we should not provide the generator class for user to specify the primary key in sql. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on a change in pull request #2645: URL: https://github.com/apache/hudi/pull/2645#discussion_r601160434 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieSparkSessionExtension.scala ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi + +import org.apache.spark.SPARK_VERSION +import org.apache.spark.sql.SparkSessionExtensions +import org.apache.spark.sql.hudi.analysis.HoodieAnalysis +import org.apache.spark.sql.hudi.parser.HoodieSqlParser + +/** + * The Hoodie SparkSessionExtension for extending the syntax and add the rules. + */ +class HoodieSparkSessionExtension extends (SparkSessionExtensions => Unit) { + override def apply(extensions: SparkSessionExtensions): Unit = { +if (SPARK_VERSION.startsWith("2.")) { Review comment: Hi @vinothchandar yes, currently we only add the parser for spark2, because spark3 has already support the merge/delete syntax. Hi @xiarixiaoyao ,Thanks for your suggestion. I will review this for spark3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on a change in pull request #2645: URL: https://github.com/apache/hudi/pull/2645#discussion_r601158842 ## File path: hudi-spark-datasource/hudi-spark2/src/main/antlr4/imports/SqlBase.g4 ## @@ -0,0 +1,1099 @@ +/* + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * This file is an adaptation of Presto's presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4 grammar. Review comment: Hi @vinothchandar , currently the sql parser is only for the spark2, I add the sql parser in the `hudi-spark2 ` project only. For spark3, as you have mentioned, it has support the delete/merge syntax, we do not need to extend the sql parser. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on a change in pull request #2645: URL: https://github.com/apache/hudi/pull/2645#discussion_r601154948 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java ## @@ -617,6 +619,16 @@ public PropertyBuilder setTableName(String tableName) { return this; } +public PropertyBuilder setTableSchema(String tableSchema) { + this.tableSchema = tableSchema; + return this; +} + +public PropertyBuilder setRowKeyFields(String rowKeyFields) { Review comment: Hi @vinothchandar , I am afraid we cannot do this for sql. The common way to specify the primary key in sql is by the row key fields. I think we should not provide the generator class for user to specify the primary key in sql. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on a change in pull request #2645: URL: https://github.com/apache/hudi/pull/2645#discussion_r601149509 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/UuidKeyGenerator.java ## @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.keygen; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.UUID; +import java.util.stream.Collectors; +import org.apache.avro.generic.GenericRecord; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.keygen.constant.KeyGeneratorOptions; + +/** + * A KeyGenerator which use the uuid as the record key. + */ +public class UuidKeyGenerator extends BuiltinKeyGenerator { Review comment: This is used for `InsertInto` for hudi table without a primary key. Currently, we must return a `RecordKey` for each record when write to hudi. If there user has not specify a primary key in the `DDL`. I provide a default `UuidKeyGenerator` which generate uuid for the record. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on a change in pull request #2645: URL: https://github.com/apache/hudi/pull/2645#discussion_r601152600 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java ## @@ -617,6 +619,16 @@ public PropertyBuilder setTableName(String tableName) { return this; } +public PropertyBuilder setTableSchema(String tableSchema) { Review comment: The `tableSchema` saved to the `hoodie.properites` is the first schema when we create table with the DDL. After that, the schema can also change after commit, this schema will store in the commit files currently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hddong opened a new pull request #1946: [HUDI-1176]Upgrade tp log4j2
hddong opened a new pull request #1946: URL: https://github.com/apache/hudi/pull/1946 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *Now, in some modules(like cli, utilities) use log4j2, and it cannot correct load config file with error log: `ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console`.* ## Brief change log *(for example:)* - *Support log4j2 config* ## Verify this pull request This pull request is a trivial rework / code cleanup without any test coverage. ## Committer checklist - [X] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on a change in pull request #2645: URL: https://github.com/apache/hudi/pull/2645#discussion_r601149509 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/UuidKeyGenerator.java ## @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.keygen; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.UUID; +import java.util.stream.Collectors; +import org.apache.avro.generic.GenericRecord; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.keygen.constant.KeyGeneratorOptions; + +/** + * A KeyGenerator which use the uuid as the record key. + */ +public class UuidKeyGenerator extends BuiltinKeyGenerator { Review comment: This is used for `InsertInto` for hudi table without a primary key. Currently, we must return a `RecordKey` for each record when write to hudi. If user has not specify a primary key in the `DDL`. I provide a default `UuidKeyGenerator` which generate uuid for the record. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on a change in pull request #2645: URL: https://github.com/apache/hudi/pull/2645#discussion_r601147731 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java ## @@ -54,9 +53,18 @@ public abstract class HoodieWriteHandle extends HoodieIOHandle { private static final Logger LOG = LogManager.getLogger(HoodieWriteHandle.class); + /** + * The input schema of the incoming dataframe. + */ + protected final Schema inputSchema; Review comment: For the case of MergeInto with a custom `HoodiePayload`, the input schema may be different from the write schema as the are transformation logical in the `payload`. So I distinguish between input and write schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support
pengzhiwei2018 commented on a change in pull request #2645: URL: https://github.com/apache/hudi/pull/2645#discussion_r601145258 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java ## @@ -394,4 +405,36 @@ public IOType getIOType() { public HoodieBaseFile baseFileForMerge() { return baseFileToMerge; } + + /** + * A special record returned by {@link HoodieRecordPayload}, which means + * {@link HoodieMergeHandle} should just skip this record. + */ + private static class IgnoreRecord implements GenericRecord { Review comment: For the `MergeInto` Statement, If there are not records match the conditions, we should filter the record by the `ExpressionPayload`. e.g. Merge Into d0 using ( select 1 as id, 'a1' as name ) s0 on d0.id = s0.id when matched and s0.id %2 = 0 then update set * The input `(1, 'a1')` will be filtered by the match condition `id %2 = 0`. In our implementation,we push all the condition and update expression to the `ExpressionPayload`, So the `Payload` must have the ability to filter the record. Currently the return of `Option.empty` means "DELTE" for record. I add a special record `IgnoredRecord` to represents the filter record. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hddong closed pull request #1946: [HUDI-1176]Upgrade tp log4j2
hddong closed pull request #1946: URL: https://github.com/apache/hudi/pull/1946 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on a change in pull request #2325: [HUDI-699]Fix CompactionCommand and add unit test for CompactionCommand
wangxianghu commented on a change in pull request #2325: URL: https://github.com/apache/hudi/pull/2325#discussion_r601092940 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java ## @@ -254,4 +288,19 @@ private int getArchivedFileSuffix(FileStatus f) { return 0; } } + + @Override + public HoodieDefaultTimeline getCommitsAndCompactionTimeline() { +// filter in-memory instants +Set validActions = CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION, COMPACTION_ACTION, REPLACE_COMMIT_ACTION); +return new HoodieDefaultTimeline(getInstants().filter(i -> +readCommits.keySet().contains(i.getTimestamp())) +.filter(s -> validActions.contains(s.getAction())), details); + } + + public HoodieArchivedTimeline filterArchivedCompactionInstant() { +// filter INFLIGHT compaction instants +return new HoodieArchivedTimeline(this.metaClient, getInstants().filter(i -> +i.isInflight() && i.getAction().equals(HoodieTimeline.COMPACTION_ACTION))); + } Review comment: This method seems not used anywhere? ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java ## @@ -108,6 +120,14 @@ public void loadInstantDetailsInMemory(String startTs, String endTs) { loadInstants(startTs, endTs); } + public void loadCompactionDetailsInMemory(String startTs, String endTs) { +// load compactionPlan +loadInstants(new TimeRangeFilter(startTs, endTs), true, record -> + record.get(ACTION_TYPE_KEY).toString().equals(HoodieTimeline.COMPACTION_ACTION) +&& HoodieInstant.State.INFLIGHT.toString().equals(record.get("actionState").toString()) Review comment: "actionState" -> ACTION_STATE ## File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java ## @@ -175,25 +174,26 @@ public String compactionShowArchived( HoodieArchivedTimeline archivedTimeline = client.getArchivedTimeline(); HoodieInstant instant = new HoodieInstant(HoodieInstant.State.COMPLETED, HoodieTimeline.COMPACTION_ACTION, compactionInstantTime); -String startTs = CommitUtil.addHours(compactionInstantTime, -1); -String endTs = CommitUtil.addHours(compactionInstantTime, 1); Review comment: if we want to load a `ts` equals `compactionInstantTime`, can we add a new method that takes only one `instantTime` as input param? WDYT ## File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java ## @@ -175,25 +174,26 @@ public String compactionShowArchived( HoodieArchivedTimeline archivedTimeline = client.getArchivedTimeline(); HoodieInstant instant = new HoodieInstant(HoodieInstant.State.COMPLETED, HoodieTimeline.COMPACTION_ACTION, compactionInstantTime); -String startTs = CommitUtil.addHours(compactionInstantTime, -1); -String endTs = CommitUtil.addHours(compactionInstantTime, 1); Review comment: why remove this? ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineMetadataUtils.java ## @@ -176,4 +179,13 @@ public static HoodieReplaceCommitMetadata deserializeHoodieReplaceMetadata(byte[ ValidationUtils.checkArgument(fileReader.hasNext(), "Could not deserialize metadata of type " + clazz); return fileReader.next(); } + + public static T deserializeAvroRecordMetadata(byte[] bytes, Schema schema, Class clazz) + throws IOException { +return deserializeAvroRecordMetadata(HoodieAvroUtils.bytesToAvro(bytes, schema), schema, clazz); + } + + public static T deserializeAvroRecordMetadata(Object object, Schema schema, Class clazz) { Review comment: param `Class clazz` seems redundant ## File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java ## @@ -175,25 +174,26 @@ public String compactionShowArchived( HoodieArchivedTimeline archivedTimeline = client.getArchivedTimeline(); HoodieInstant instant = new HoodieInstant(HoodieInstant.State.COMPLETED, HoodieTimeline.COMPACTION_ACTION, compactionInstantTime); -String startTs = CommitUtil.addHours(compactionInstantTime, -1); -String endTs = CommitUtil.addHours(compactionInstantTime, 1); try { - archivedTimeline.loadInstantDetailsInMemory(startTs, endTs); - HoodieCompactionPlan compactionPlan = TimelineMetadataUtils.deserializeCompactionPlan( - archivedTimeline.getInstantDetails(instant).get()); + archivedTimeline.loadCompactionDetailsInMemory(compactionInstantTime, compactionInstantTime); + HoodieCompactionPlan compactionPlan = TimelineMetadataUtils.deserializeAvroRecordMetadata( + archivedTimeline.getInstantDetails(instant).get(), HoodieCompactionPlan.getClassSchema(), + HoodieCompactionPlan.class);
[GitHub] [hudi] prashantwason commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…
prashantwason commented on pull request #2334: URL: https://github.com/apache/hudi/pull/2334#issuecomment-806413911 So to rephrase the description this solves the case where the input-data has a compatible field (int) to be written to the table (long field). Can this issue not be solved at the input record level by converting the "int" data into the "long" before writing into HUDI? hoodieTable.getTableSchema() always returns the "latest" schema which is the schema used for the last HoodieWriteClient (saved into commit instants). So when the "int" based RDD is written the table schema will no long have a "long" field. When this table schema is used to read an older file in the table (merge during updatehandle) then the reading should fail as a long (from parquet) cannot be converted to an int (from schema). This is actually a backward incompatible schema change and hence is not allowed by HUDI. @pengzhiwei2018 Can you add a test to verify my hypothesis? In your existing test in TestCOWDataSource, can you write a long to the table in the next write? Also, can you read all the data using the "int" schema, even the older records which contain a long? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org