[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-26 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769338#comment-17769338
 ] 

Thomas Graves commented on SPARK-44940:
---

 I noticed this went into 3.5.0  
([https://github.com/apache/spark/commits/v3.5.0)] so updating the fixed 
versions.

> Improve performance of JSON parsing when 
> "spark.sql.json.enablePartialResults" is enabled
> -
>
> Key: SPARK-44940
> URL: https://issues.apache.org/jira/browse/SPARK-44940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 3.4.2, 3.5.1
>
>
> Follow-up on https://issues.apache.org/jira/browse/SPARK-40646.
> I found that JSON parsing is significantly slower due to exception creation 
> in control flow. Also, some fields are not parsed correctly and the exception 
> is thrown in certain cases: 
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590)
>   ... 39 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-04 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761892#comment-17761892
 ] 

Dongjoon Hyun commented on SPARK-44940:
---

This is backported to branch-3.4 via https://github.com/apache/spark/pull/42792

> Improve performance of JSON parsing when 
> "spark.sql.json.enablePartialResults" is enabled
> -
>
> Key: SPARK-44940
> URL: https://issues.apache.org/jira/browse/SPARK-44940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.2, 3.5.1
>
>
> Follow-up on https://issues.apache.org/jira/browse/SPARK-40646.
> I found that JSON parsing is significantly slower due to exception creation 
> in control flow. Also, some fields are not parsed correctly and the exception 
> is thrown in certain cases: 
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590)
>   ... 39 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-03 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761689#comment-17761689
 ] 

Snoot.io commented on SPARK-44940:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/42792

> Improve performance of JSON parsing when 
> "spark.sql.json.enablePartialResults" is enabled
> -
>
> Key: SPARK-44940
> URL: https://issues.apache.org/jira/browse/SPARK-44940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> Follow-up on https://issues.apache.org/jira/browse/SPARK-40646.
> I found that JSON parsing is significantly slower due to exception creation 
> in control flow. Also, some fields are not parsed correctly and the exception 
> is thrown in certain cases: 
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590)
>   ... 39 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-03 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761681#comment-17761681
 ] 

Snoot.io commented on SPARK-44940:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/42790

> Improve performance of JSON parsing when 
> "spark.sql.json.enablePartialResults" is enabled
> -
>
> Key: SPARK-44940
> URL: https://issues.apache.org/jira/browse/SPARK-44940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> Follow-up on https://issues.apache.org/jira/browse/SPARK-40646.
> I found that JSON parsing is significantly slower due to exception creation 
> in control flow. Also, some fields are not parsed correctly and the exception 
> is thrown in certain cases: 
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590)
>   ... 39 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-09-03 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761679#comment-17761679
 ] 

Snoot.io commented on SPARK-44940:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/42790

> Improve performance of JSON parsing when 
> "spark.sql.json.enablePartialResults" is enabled
> -
>
> Key: SPARK-44940
> URL: https://issues.apache.org/jira/browse/SPARK-44940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> Follow-up on https://issues.apache.org/jira/browse/SPARK-40646.
> I found that JSON parsing is significantly slower due to exception creation 
> in control flow. Also, some fields are not parsed correctly and the exception 
> is thrown in certain cases: 
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590)
>   ... 39 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-08-30 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760708#comment-17760708
 ] 

Snoot.io commented on SPARK-44940:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/42667

> Improve performance of JSON parsing when 
> "spark.sql.json.enablePartialResults" is enabled
> -
>
> Key: SPARK-44940
> URL: https://issues.apache.org/jira/browse/SPARK-44940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> Follow-up on https://issues.apache.org/jira/browse/SPARK-40646.
> I found that JSON parsing is significantly slower due to exception creation 
> in control flow. Also, some fields are not parsed correctly and the exception 
> is thrown in certain cases: 
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590)
>   ... 39 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44940) Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

2023-08-24 Thread Ivan Sadikov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758792#comment-17758792
 ] 

Ivan Sadikov commented on SPARK-44940:
--

Opened https://github.com/apache/spark/pull/42667.

> Improve performance of JSON parsing when 
> "spark.sql.json.enablePartialResults" is enabled
> -
>
> Key: SPARK-44940
> URL: https://issues.apache.org/jira/browse/SPARK-44940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0, 4.0.0
>Reporter: Ivan Sadikov
>Priority: Major
>
> Follow-up on https://issues.apache.org/jira/browse/SPARK-40646.
> I found that JSON parsing is significantly slower due to exception creation 
> in control flow. Also, some fields are not parsed correctly and the exception 
> is thrown in certain cases: 
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590)
>   ... 39 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org