[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
[ https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769338#comment-17769338 ] Thomas Graves commented on SPARK-44940: --- I noticed this went into 3.5.0 ([https://github.com/apache/spark/commits/v3.5.0)] so updating the fixed versions. > Improve performance of JSON parsing when > "spark.sql.json.enablePartialResults" is enabled > - > > Key: SPARK-44940 > URL: https://issues.apache.org/jira/browse/SPARK-44940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Labels: correctness, pull-request-available > Fix For: 3.4.2, 3.5.1 > > > Follow-up on https://issues.apache.org/jira/browse/SPARK-40646. > I found that JSON parsing is significantly slower due to exception creation > in control flow. Also, some fields are not parsed correctly and the exception > is thrown in certain cases: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590) > ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
[ https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761892#comment-17761892 ] Dongjoon Hyun commented on SPARK-44940: --- This is backported to branch-3.4 via https://github.com/apache/spark/pull/42792 > Improve performance of JSON parsing when > "spark.sql.json.enablePartialResults" is enabled > - > > Key: SPARK-44940 > URL: https://issues.apache.org/jira/browse/SPARK-44940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Labels: correctness > Fix For: 3.4.2, 3.5.1 > > > Follow-up on https://issues.apache.org/jira/browse/SPARK-40646. > I found that JSON parsing is significantly slower due to exception creation > in control flow. Also, some fields are not parsed correctly and the exception > is thrown in certain cases: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590) > ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
[ https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761689#comment-17761689 ] Snoot.io commented on SPARK-44940: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/42792 > Improve performance of JSON parsing when > "spark.sql.json.enablePartialResults" is enabled > - > > Key: SPARK-44940 > URL: https://issues.apache.org/jira/browse/SPARK-44940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Ivan Sadikov >Priority: Major > > Follow-up on https://issues.apache.org/jira/browse/SPARK-40646. > I found that JSON parsing is significantly slower due to exception creation > in control flow. Also, some fields are not parsed correctly and the exception > is thrown in certain cases: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590) > ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
[ https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761681#comment-17761681 ] Snoot.io commented on SPARK-44940: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/42790 > Improve performance of JSON parsing when > "spark.sql.json.enablePartialResults" is enabled > - > > Key: SPARK-44940 > URL: https://issues.apache.org/jira/browse/SPARK-44940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Ivan Sadikov >Priority: Major > > Follow-up on https://issues.apache.org/jira/browse/SPARK-40646. > I found that JSON parsing is significantly slower due to exception creation > in control flow. Also, some fields are not parsed correctly and the exception > is thrown in certain cases: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590) > ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
[ https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761679#comment-17761679 ] Snoot.io commented on SPARK-44940: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/42790 > Improve performance of JSON parsing when > "spark.sql.json.enablePartialResults" is enabled > - > > Key: SPARK-44940 > URL: https://issues.apache.org/jira/browse/SPARK-44940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Ivan Sadikov >Priority: Major > > Follow-up on https://issues.apache.org/jira/browse/SPARK-40646. > I found that JSON parsing is significantly slower due to exception creation > in control flow. Also, some fields are not parsed correctly and the exception > is thrown in certain cases: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590) > ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
[ https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760708#comment-17760708 ] Snoot.io commented on SPARK-44940: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/42667 > Improve performance of JSON parsing when > "spark.sql.json.enablePartialResults" is enabled > - > > Key: SPARK-44940 > URL: https://issues.apache.org/jira/browse/SPARK-44940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Ivan Sadikov >Priority: Major > > Follow-up on https://issues.apache.org/jira/browse/SPARK-40646. > I found that JSON parsing is significantly slower due to exception creation > in control flow. Also, some fields are not parsed correctly and the exception > is thrown in certain cases: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590) > ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled
[ https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758792#comment-17758792 ] Ivan Sadikov commented on SPARK-44940: -- Opened https://github.com/apache/spark/pull/42667. > Improve performance of JSON parsing when > "spark.sql.json.enablePartialResults" is enabled > - > > Key: SPARK-44940 > URL: https://issues.apache.org/jira/browse/SPARK-44940 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Ivan Sadikov >Priority: Major > > Follow-up on https://issues.apache.org/jira/browse/SPARK-40646. > I found that JSON parsing is significantly slower due to exception creation > in control flow. Also, some fields are not parsed correctly and the exception > is thrown in certain cases: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590) > ... 39 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org