[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345949#comment-16345949 ] Apache Spark commented on SPARK-23157: -- User 'henryr' has created a pull request for this issue: https://github.com/apache/spark/pull/20443 > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344078#comment-16344078 ] Apache Spark commented on SPARK-23157: -- User 'henryr' has created a pull request for this issue: https://github.com/apache/spark/pull/20429 > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343962#comment-16343962 ] Henry Robinson commented on SPARK-23157: [~kretes] - I can see an argument for the behaviour you're describing, but that's not the way the API is apparently intended. Like Sean says, there are way too many ways to shoot yourself in the foot if you can stitch together arbitrary datasets like this if the Datasets are column-wise incompatible, and allowing the relatively small subset of cases where it would work would lead to a more confusing API, IMO. The documentation for {{withColumn()}} could be updated to make this clearer; if I get a moment today I'll submit a PR. > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343279#comment-16343279 ] Sean Owen commented on SPARK-23157: --- Agree this should not work . You are selecting a column from a different Dataset. Happening to work because a number of cols matches or the function is the identity sounds like as much way to write bugs as convenience > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343168#comment-16343168 ] Tomasz Bartczak commented on SPARK-23157: - [~henryr] I see what you mean in your example. However I would expect my example to work both with an identity and with any arbitrary function. My expectations for this API come from e.g. pandas where I can operate on columns and if I have the same number of elements in column - I can use it to in multiple Datasets. And since in my example root Dataset is the same and there is no shuffling/filtering or any other operation that may change the order or number of rows - adding column that is an effect of some operation on same root Dataset should be possible. > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340412#comment-16340412 ] Henry Robinson commented on SPARK-23157: I'm not sure if this should actually be expected to work. {{Dataset.map()}} will always return a dataset with a logical plan that's different to the original, so {{ds.map(a => a).col("id")}} has an expression that refers to an attribute ID that isn't produced by the original dataset. It seems like the requirement for {{ds.withColumn()}} is that the column argument is an expression over {{ds}}'s logical plan. You get the same error doing the following, which is more explicit about these being two separate datasets. {code:java} scala> val ds = spark.createDataset(Seq(R("1"))) ds: org.apache.spark.sql.Dataset[R] = [id: string] scala> val ds2 = spark.createDataset(Seq(R("1"))) ds2: org.apache.spark.sql.Dataset[R] = [id: string] scala> ds.withColumn("id2", ds2.col("id")) org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#113 missing from id#1 in operator !Project [id#1, id#113 AS id2#115]. Attribute(s) with the same name appear in the operation: id. Please check if the right attribute(s) are used.;; !Project [id#1, id#113 AS id2#115] +- LocalRelation [id#1] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:297) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3286) at org.apache.spark.sql.Dataset.select(Dataset.scala:1303) at org.apache.spark.sql.Dataset.withColumns(Dataset.scala:2185) at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:2152) ... 49 elided {code} If the {{map}} function weren't the identity, would you expect this still to work? > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org