[jira] [Updated] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
[ https://issues.apache.org/jira/browse/HUDI-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1720: -- Status: In Progress (was: Open) > when query incr view of mor table which has many delete records use > sparksql/hive-beeline, StackOverflowError > --- > > Key: HUDI-1720 > URL: https://issues.apache.org/jira/browse/HUDI-1720 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration, Spark Integration >Affects Versions: 0.7.0, 0.8.0 >Reporter: tao meng >Assignee: tao meng >Priority: Major > Labels: pull-request-available, sev:critical, user-support-issues > Fix For: 0.9.0 > > > now RealtimeCompactedRecordReader.next deal with delete record by > recursion, see: > [https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106] > however when the log file contains many delete record, the logcial of > RealtimeCompactedRecordReader.next will lead stackOverflowError > test step: > step1: > val df = spark.range(0, 100).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // bulk_insert 100w row (keyid from 0 to 100) > merge(df, 4, "default", "hive_9b", > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") > step2: > val df = spark.range(0, 90).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // delete 90w row (keyid from 0 to 90) > delete(df, 4, "default", "hive_9b") > step3: > query on beeline/spark-sql : select count(col3) from hive_9b_rt > 2021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, > RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | > INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | > Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running > child : java.lang.StackOverflowError at > org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) > at > org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39) > at > org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344) > at > org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503) > at > org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30) > at > org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at >
[jira] [Updated] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
[ https://issues.apache.org/jira/browse/HUDI-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1720: -- Labels: pull-request-available sev:critical user-support-issues (was: pull-request-available) > when query incr view of mor table which has many delete records use > sparksql/hive-beeline, StackOverflowError > --- > > Key: HUDI-1720 > URL: https://issues.apache.org/jira/browse/HUDI-1720 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration, Spark Integration >Affects Versions: 0.7.0, 0.8.0 >Reporter: tao meng >Priority: Major > Labels: pull-request-available, sev:critical, user-support-issues > Fix For: 0.9.0 > > > now RealtimeCompactedRecordReader.next deal with delete record by > recursion, see: > [https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106] > however when the log file contains many delete record, the logcial of > RealtimeCompactedRecordReader.next will lead stackOverflowError > test step: > step1: > val df = spark.range(0, 100).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // bulk_insert 100w row (keyid from 0 to 100) > merge(df, 4, "default", "hive_9b", > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") > step2: > val df = spark.range(0, 90).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // delete 90w row (keyid from 0 to 90) > delete(df, 4, "default", "hive_9b") > step3: > query on beeline/spark-sql : select count(col3) from hive_9b_rt > 2021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, > RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | > INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | > Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running > child : java.lang.StackOverflowError at > org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) > at > org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39) > at > org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344) > at > org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503) > at > org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30) > at > org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at >
[jira] [Updated] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
[ https://issues.apache.org/jira/browse/HUDI-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1720: - Labels: pull-request-available (was: ) > when query incr view of mor table which has many delete records use > sparksql/hive-beeline, StackOverflowError > --- > > Key: HUDI-1720 > URL: https://issues.apache.org/jira/browse/HUDI-1720 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration, Spark Integration >Affects Versions: 0.7.0, 0.8.0 >Reporter: tao meng >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > now RealtimeCompactedRecordReader.next deal with delete record by > recursion, see: > [https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106] > however when the log file contains many delete record, the logcial of > RealtimeCompactedRecordReader.next will lead stackOverflowError > test step: > step1: > val df = spark.range(0, 100).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // bulk_insert 100w row (keyid from 0 to 100) > merge(df, 4, "default", "hive_9b", > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") > step2: > val df = spark.range(0, 90).toDF("keyid") > .withColumn("col3", expr("keyid + 1000")) > .withColumn("p", lit(0)) > .withColumn("p1", lit(0)) > .withColumn("p2", lit(7)) > .withColumn("a1", lit(Array[String]("sb1", "rz"))) > .withColumn("a2", lit(Array[String]("sb1", "rz"))) > // delete 90w row (keyid from 0 to 90) > delete(df, 4, "default", "hive_9b") > step3: > query on beeline/spark-sql : select count(col3) from hive_9b_rt > 2021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, > RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | > INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | > Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running > child : java.lang.StackOverflowError at > org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) > at > org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39) > at > org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344) > at > org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503) > at > org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30) > at > org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) > at >