[ https://issues.apache.org/jira/browse/SPARK-28702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen resolved SPARK-28702. -------------------------------- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25503 [https://github.com/apache/spark/pull/25503] > Display useful error message (instead of NPE) for invalid Dataset operations > (e.g. calling actions inside of transformations) > ----------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-28702 > URL: https://issues.apache.org/jira/browse/SPARK-28702 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Josh Rosen > Assignee: Shivu Sondur > Priority: Major > Fix For: 3.0.0 > > > In Spark, SparkContext and SparkSession can only be used on the driver, not > on executors. For example, this means that you cannot call > {{someDataset.collect()}} inside of a Dataset or RDD transformation. > When Spark serializes RDDs and Datasets, references to SparkContext and > SparkSession are null'ed out (by being marked as {{@transient}} or via the > Closure Cleaner). As a result, RDD and Dataset methods which reference use > these driver-side-only objects (e.g. actions or transformations) will see > {{null}} references and may fail with a {{NullPointerException}}. For > example, in code which (via a chain of calls) tried to {{collect()}} a > dataset inside of a Dataset.map operation: > {code:java}Caused by: java.lang.NullPointerException > at > <http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution$lzycompute(Dataset.scala:3027) > at > <http://org.apache.spark.sql.Dataset.org|org.apache.spark.sql.Dataset.org>$apache$spark$sql$Dataset$$rddQueryExecution(Dataset.scala:3025) > at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3038) > at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3036) > [...] {code} > The resulting NPE can be _very_ confusing to users. > In SPARK-5063 I added some logic to throw clearer error messages when > performing similar invalid actions on RDDs. This ticket's scope is to > implement similar logic for Datasets. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org