[jira] [Updated] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-04-19 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1720:
--
Status: In Progress  (was: Open)

> when query incr view of  mor table which has many delete records use 
> sparksql/hive-beeline,  StackOverflowError
> ---
>
> Key: HUDI-1720
> URL: https://issues.apache.org/jira/browse/HUDI-1720
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration, Spark Integration
>Affects Versions: 0.7.0, 0.8.0
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available, sev:critical, user-support-issues
> Fix For: 0.9.0
>
>
>  now RealtimeCompactedRecordReader.next   deal with delete record by 
> recursion, see:
> [https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106]
> however when the log file contains many delete record,  the logcial of 
> RealtimeCompactedRecordReader.next  will lead stackOverflowError
> test step:
> step1:
> val df = spark.range(0, 100).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // bulk_insert 100w row (keyid from 0 to 100)
> merge(df, 4, "default", "hive_9b", 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> step2:
> val df = spark.range(0, 90).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // delete 90w row (keyid from 0 to 90)
> delete(df, 4, "default", "hive_9b")
> step3:
> query on beeline/spark-sql :  select count(col3)  from hive_9b_rt
> 2021-03-25 15:33:29,029 | INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, 
> RECORDS_OUT_INTERMEDIATE:1,  | Operator.java:10382021-03-25 15:33:29,029 | 
> INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1,  | 
> Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running 
> child : java.lang.StackOverflowError at 
> org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) 
> at 
> org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> 

[jira] [Updated] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-28 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1720:
--
Labels: pull-request-available sev:critical user-support-issues  (was: 
pull-request-available)

> when query incr view of  mor table which has many delete records use 
> sparksql/hive-beeline,  StackOverflowError
> ---
>
> Key: HUDI-1720
> URL: https://issues.apache.org/jira/browse/HUDI-1720
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration, Spark Integration
>Affects Versions: 0.7.0, 0.8.0
>Reporter: tao meng
>Priority: Major
>  Labels: pull-request-available, sev:critical, user-support-issues
> Fix For: 0.9.0
>
>
>  now RealtimeCompactedRecordReader.next   deal with delete record by 
> recursion, see:
> [https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106]
> however when the log file contains many delete record,  the logcial of 
> RealtimeCompactedRecordReader.next  will lead stackOverflowError
> test step:
> step1:
> val df = spark.range(0, 100).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // bulk_insert 100w row (keyid from 0 to 100)
> merge(df, 4, "default", "hive_9b", 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> step2:
> val df = spark.range(0, 90).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // delete 90w row (keyid from 0 to 90)
> delete(df, 4, "default", "hive_9b")
> step3:
> query on beeline/spark-sql :  select count(col3)  from hive_9b_rt
> 2021-03-25 15:33:29,029 | INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, 
> RECORDS_OUT_INTERMEDIATE:1,  | Operator.java:10382021-03-25 15:33:29,029 | 
> INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1,  | 
> Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running 
> child : java.lang.StackOverflowError at 
> org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) 
> at 
> org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> 

[jira] [Updated] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1720:
-
Labels: pull-request-available  (was: )

> when query incr view of  mor table which has many delete records use 
> sparksql/hive-beeline,  StackOverflowError
> ---
>
> Key: HUDI-1720
> URL: https://issues.apache.org/jira/browse/HUDI-1720
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration, Spark Integration
>Affects Versions: 0.7.0, 0.8.0
>Reporter: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
>  now RealtimeCompactedRecordReader.next   deal with delete record by 
> recursion, see:
> [https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106]
> however when the log file contains many delete record,  the logcial of 
> RealtimeCompactedRecordReader.next  will lead stackOverflowError
> test step:
> step1:
> val df = spark.range(0, 100).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // bulk_insert 100w row (keyid from 0 to 100)
> merge(df, 4, "default", "hive_9b", 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> step2:
> val df = spark.range(0, 90).toDF("keyid")
>  .withColumn("col3", expr("keyid + 1000"))
>  .withColumn("p", lit(0))
>  .withColumn("p1", lit(0))
>  .withColumn("p2", lit(7))
>  .withColumn("a1", lit(Array[String]("sb1", "rz")))
>  .withColumn("a2", lit(Array[String]("sb1", "rz")))
> // delete 90w row (keyid from 0 to 90)
> delete(df, 4, "default", "hive_9b")
> step3:
> query on beeline/spark-sql :  select count(col3)  from hive_9b_rt
> 2021-03-25 15:33:29,029 | INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, 
> RECORDS_OUT_INTERMEDIATE:1,  | Operator.java:10382021-03-25 15:33:29,029 | 
> INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1,  | 
> Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running 
> child : java.lang.StackOverflowError at 
> org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) 
> at 
> org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)
>  at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
> org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
>  at 
>