[ 
https://issues.apache.org/jira/browse/SPARK-28702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-28702.
--------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 25503
[https://github.com/apache/spark/pull/25503]

> Display useful error message (instead of NPE) for invalid Dataset operations 
> (e.g. calling actions inside of transformations)
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-28702
>                 URL: https://issues.apache.org/jira/browse/SPARK-28702
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Josh Rosen
>            Assignee: Shivu Sondur
>            Priority: Major
>             Fix For: 3.0.0
>
>
> In Spark, SparkContext and SparkSession can only be used on the driver, not 
> on executors. For example, this means that you cannot call 
> {{someDataset.collect()}} inside of a Dataset or RDD transformation.
> When Spark serializes RDDs and Datasets, references to SparkContext and 
> SparkSession are null'ed out (by being marked as {{@transient}} or via the 
> Closure Cleaner). As a result, RDD and Dataset methods which reference use 
> these driver-side-only objects (e.g. actions or transformations) will see 
> {{null}} references and may fail with a {{NullPointerException}}. For 
> example, in code which (via a chain of calls) tried to {{collect()}} a 
> dataset inside of a Dataset.map operation:
> {code:java}Caused by: java.lang.NullPointerException
> at 
> <http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution$lzycompute(Dataset.scala:3027)
> at 
> <http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution(Dataset.scala:3025)
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3038)
> at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3036)
> [...] {code}
> The resulting NPE can be _very_ confusing to users.
> In SPARK-5063 I added some logic to throw clearer error messages when 
> performing similar invalid actions on RDDs. This ticket's scope is to 
> implement similar logic for Datasets.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to