[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-16 Thread sandeshyapuram (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975733#comment-16975733
 ] 

sandeshyapuram commented on SPARK-29890:


[~imback82] This happens even for a normal join:
{noformat}
val p1 = spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", 
"abc")
val p2 = spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", 
"abc")
p1.join(p2, Seq("nums"), "left")
.na.fill(0).show
{noformat}

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-15 Thread sandeshyapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29890:
---
Affects Version/s: 2.4.3

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-14 Thread sandeshyapuram (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974087#comment-16974087
 ] 

sandeshyapuram edited comment on SPARK-29890 at 11/14/19 9:36 AM:
--

I've raised it as a bug because I feel fill.na(0) needs to fill 0 regardless of 
duplicate column names.

[~cloud_fan] Thoughts


was (Author: sandeshyapuram):
I've raised it as a bug because I feel fill.na(0) needs to fill 0 regardless of 
duplicate column names.

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-14 Thread sandeshyapuram (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974087#comment-16974087
 ] 

sandeshyapuram commented on SPARK-29890:


I've raised it as a bug because I feel fill.na(0) needs to fill 0 regardless of 
duplicate column names.

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-14 Thread sandeshyapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29890:
---
Description: 
Trying to fill out na values with 0.
{noformat}
scala> :paste
// Entering paste mode (ctrl-D to finish)
val parent = 
spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
val c1 = parent.filter(lit(true))
val c2 = parent.filter(lit(true))
c1.join(c2, Seq("nums"), "left")
.na.fill(0).show{noformat}
{noformat}
9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
error looking up the name of group 820818257: No such file or directory
org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could be: 
abc, abc.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
  at 
org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
  at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
  at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
  ... 54 elided{noformat}
 

  was:
Trying to fill out na values with 0.
{noformat}
scala> :paste
// Entering paste mode (ctrl-D to finish)val parent = 
spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
val c1 = parent.filter(lit(true))
val c2 = parent.filter(lit(true))
c1.join(c2, Seq("nums"), "left")
.na.fill(0).show{noformat}
{noformat}
9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
error looking up the name of group 820818257: No such file or directory
org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could be: 
abc, abc.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
  at 
org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
  at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
  at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
  ... 54 elided{noformat}
 


> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val 

[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-14 Thread sandeshyapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29890:
---
Description: 
Trying to fill out na values with 0.
{noformat}
scala> :paste
// Entering paste mode (ctrl-D to finish)val parent = 
spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
val c1 = parent.filter(lit(true))
val c2 = parent.filter(lit(true))
c1.join(c2, Seq("nums"), "left")
.na.fill(0).show{noformat}
{noformat}
9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
error looking up the name of group 820818257: No such file or directory
org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could be: 
abc, abc.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
  at 
org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
  at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
  at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
  ... 54 elided{noformat}
 

  was:
Trying to fill out na values with 0.
{code:java}
scala> :paste
// Entering paste mode (ctrl-D to finish)val parent = 
spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
val c1 = parent.filter(lit(true))
val c2 = parent.filter(lit(true))
c1.join(c2, Seq("nums"), "left")
.na.fill(0).show{code}


> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> 

[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-14 Thread sandeshyapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29890:
---
Description: 
Trying to fill out na values with 0.
{code:java}
scala> :paste
// Entering paste mode (ctrl-D to finish)val parent = 
spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
val c1 = parent.filter(lit(true))
val c2 = parent.filter(lit(true))
c1.join(c2, Seq("nums"), "left")
.na.fill(0).show{code}

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {code:java}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-14 Thread sandeshyapuram (Jira)
sandeshyapuram created SPARK-29890:
--

 Summary: Unable to fill na with 0 with duplicate columns
 Key: SPARK-29890
 URL: https://issues.apache.org/jira/browse/SPARK-29890
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.3.3
 Environment: Trying to fill out na values with 0.
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.3
  /_/Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.scala> :paste
// Entering paste mode (ctrl-D to finish)val parent = 
spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
val c1 = parent.filter(lit(true))
val c2 = parent.filter(lit(true))
c1.join(c2, Seq("nums"), "left")
.na.fill(0).show
// Exiting paste mode, now interpreting.ivysettings.xml file not found in 
HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
19/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
error looking up the name of group 820818257: No such file or directory
org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could be: 
abc, abc.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
  at 
org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
  at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
  at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
  ... 54 elided
{code}
Reporter: sandeshyapuram






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-14 Thread sandeshyapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29890:
---
Environment: (was: Trying to fill out na values with 0.
{code:java}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.3
  /_/Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.scala> :paste
// Entering paste mode (ctrl-D to finish)val parent = 
spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
val c1 = parent.filter(lit(true))
val c2 = parent.filter(lit(true))
c1.join(c2, Seq("nums"), "left")
.na.fill(0).show
// Exiting paste mode, now interpreting.ivysettings.xml file not found in 
HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
19/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
error looking up the name of group 820818257: No such file or directory
org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could be: 
abc, abc.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
  at 
org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
  at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
  at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
  at 
org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
  ... 54 elided
{code})

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3
>Reporter: sandeshyapuram
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-11-13 Thread sandeshyapuram (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973985#comment-16973985
 ] 

sandeshyapuram commented on SPARK-29682:


Thanks!

> Failure when resolving conflicting references in Join:
> --
>
> Key: SPARK-29682
> URL: https://issues.apache.org/jira/browse/SPARK-29682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.3
>Reporter: sandeshyapuram
>Assignee: Terry Kim
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> When I try to self join a parentDf with multiple childDf say childDf1 ... ... 
> where childDfs are derived after a cube or rollup and are filtered based on 
> group bys,
> I get and error 
> {{Failure when resolving conflicting references in Join: }}
> This shows a long error message which is quite unreadable. On the other hand, 
> if I replace cube or rollup with old groupBy, it works without issues.
>  
> *Sample code:* 
> {code:java}
> val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")
> val cubeDF = numsDF
> .cube("nums")
> .agg(
> max(lit(0)).as("agcol"),
> grouping_id().as("gid")
> )
> 
> val group0 = cubeDF.filter(col("gid") <=> lit(0))
> val group1 = cubeDF.filter(col("gid") <=> lit(1))
> cubeDF.printSchema
> group0.printSchema
> group1.printSchema
> //Recreating cubeDf
> cubeDF.select("nums").distinct
> .join(group0, Seq("nums"), "inner")
> .join(group1, Seq("nums"), "inner")
> .show
> {code}
> *Sample output:*
> {code:java}
> numsDF: org.apache.spark.sql.DataFrame = [nums: int]
> cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
> field]
> group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> org.apache.spark.sql.AnalysisException:
> Failure when resolving conflicting references in Join:
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
> Conflicting attributes: nums#220
> ;;
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96)
>   at 
> 

[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-11-01 Thread sandeshyapuram (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965216#comment-16965216
 ] 

sandeshyapuram commented on SPARK-29682:


[~imback82]  & [~cloud_fan] Currently I've worked my way around by renaming 
every column in the dataframes to perform joins and that works.

Let me know if you have a better workaround to deal with it.

> Failure when resolving conflicting references in Join:
> --
>
> Key: SPARK-29682
> URL: https://issues.apache.org/jira/browse/SPARK-29682
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> When I try to self join a parentDf with multiple childDf say childDf1 ... ... 
> where childDfs are derived after a cube or rollup and are filtered based on 
> group bys,
> I get and error 
> {{Failure when resolving conflicting references in Join: }}
> This shows a long error message which is quite unreadable. On the other hand, 
> if I replace cube or rollup with old groupBy, it works without issues.
>  
> *Sample code:* 
> {code:java}
> val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")
> val cubeDF = numsDF
> .cube("nums")
> .agg(
> max(lit(0)).as("agcol"),
> grouping_id().as("gid")
> )
> 
> val group0 = cubeDF.filter(col("gid") <=> lit(0))
> val group1 = cubeDF.filter(col("gid") <=> lit(1))
> cubeDF.printSchema
> group0.printSchema
> group1.printSchema
> //Recreating cubeDf
> cubeDF.select("nums").distinct
> .join(group0, Seq("nums"), "inner")
> .join(group1, Seq("nums"), "inner")
> .show
> {code}
> *Sample output:*
> {code:java}
> numsDF: org.apache.spark.sql.DataFrame = [nums: int]
> cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
> field]
> group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> org.apache.spark.sql.AnalysisException:
> Failure when resolving conflicting references in Join:
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
> Conflicting attributes: nums#220
> ;;
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>   at 
> 

[jira] [Updated] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-10-31 Thread sandeshyapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29682:
---
Component/s: Spark Shell
 Spark Core

> Failure when resolving conflicting references in Join:
> --
>
> Key: SPARK-29682
> URL: https://issues.apache.org/jira/browse/SPARK-29682
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> When I try to self join a parentDf with multiple childDf say childDf1 ... ... 
> where childDfs are derived after a cube or rollup and are filtered based on 
> group bys,
> I get and error 
> {{Failure when resolving conflicting references in Join: }}
> This shows a long error message which is quite unreadable. On the other hand, 
> if I replace cube or rollup with old groupBy, it works without issues.
>  
> *Sample code:* 
> {code:java}
> val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")
> val cubeDF = numsDF
> .cube("nums")
> .agg(
> max(lit(0)).as("agcol"),
> grouping_id().as("gid")
> )
> 
> val group0 = cubeDF.filter(col("gid") <=> lit(0))
> val group1 = cubeDF.filter(col("gid") <=> lit(1))
> cubeDF.printSchema
> group0.printSchema
> group1.printSchema
> //Recreating cubeDf
> cubeDF.select("nums").distinct
> .join(group0, Seq("nums"), "inner")
> .join(group1, Seq("nums"), "inner")
> .show
> {code}
> *Sample output:*
> {code:java}
> numsDF: org.apache.spark.sql.DataFrame = [nums: int]
> cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
> field]
> group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> org.apache.spark.sql.AnalysisException:
> Failure when resolving conflicting references in Join:
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
> Conflicting attributes: nums#220
> ;;
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:335)
> 

[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-10-31 Thread sandeshyapuram (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963789#comment-16963789
 ] 

sandeshyapuram commented on SPARK-29682:


I have reproduced this in spark-submit as well as cubole notebooks.

I'm not sure how I can provide you with a self reproducer

> Failure when resolving conflicting references in Join:
> --
>
> Key: SPARK-29682
> URL: https://issues.apache.org/jira/browse/SPARK-29682
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> When I try to self join a parentDf with multiple childDf say childDf1 ... ... 
> where childDfs are derived after a cube or rollup and are filtered based on 
> group bys,
> I get and error 
> {{Failure when resolving conflicting references in Join: }}
> This shows a long error message which is quite unreadable. On the other hand, 
> if I replace cube or rollup with old groupBy, it works without issues.
>  
> *Sample code:* 
> {code:java}
> val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")
> val cubeDF = numsDF
> .cube("nums")
> .agg(
> max(lit(0)).as("agcol"),
> grouping_id().as("gid")
> )
> 
> val group0 = cubeDF.filter(col("gid") <=> lit(0))
> val group1 = cubeDF.filter(col("gid") <=> lit(1))
> cubeDF.printSchema
> group0.printSchema
> group1.printSchema
> //Recreating cubeDf
> cubeDF.select("nums").distinct
> .join(group0, Seq("nums"), "inner")
> .join(group1, Seq("nums"), "inner")
> .show
> {code}
> *Sample output:*
> {code:java}
> numsDF: org.apache.spark.sql.DataFrame = [nums: int]
> cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
> field]
> group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> org.apache.spark.sql.AnalysisException:
> Failure when resolving conflicting references in Join:
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
> Conflicting attributes: nums#220
> ;;
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96)
>   at 
> 

[jira] [Updated] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-10-31 Thread sandeshyapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29682:
---
Description: 
When I try to self join a parentDf with multiple childDf say childDf1 ... ... 

where childDfs are derived after a cube or rollup and are filtered based on 
group bys,

I get and error 

{{Failure when resolving conflicting references in Join: }}

This shows a long error message which is quite unreadable. On the other hand, 
if I replace cube or rollup with old groupBy, it works without issues.

 

*Sample code:* 
{code:java}
val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")


val cubeDF = numsDF
.cube("nums")
.agg(
max(lit(0)).as("agcol"),
grouping_id().as("gid")
)

val group0 = cubeDF.filter(col("gid") <=> lit(0))
val group1 = cubeDF.filter(col("gid") <=> lit(1))

cubeDF.printSchema
group0.printSchema
group1.printSchema


//Recreating cubeDf
cubeDF.select("nums").distinct
.join(group0, Seq("nums"), "inner")
.join(group1, Seq("nums"), "inner")
.show
{code}
*Sample output:*
{code:java}
numsDF: org.apache.spark.sql.DataFrame = [nums: int]
cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
field]
group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
agcol: int ... 1 more field]
group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
agcol: int ... 1 more field]
root
 |-- nums: integer (nullable = true)
 |-- agcol: integer (nullable = true)
 |-- gid: integer (nullable = false)
root
 |-- nums: integer (nullable = true)
 |-- agcol: integer (nullable = true)
 |-- gid: integer (nullable = false)
root
 |-- nums: integer (nullable = true)
 |-- agcol: integer (nullable = true)
 |-- gid: integer (nullable = false)
org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:
'Join Inner
:- Deduplicate [nums#220]
:  +- Project [nums#220]
: +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
agcol#216, spark_grouping_id#218 AS gid#217]
:+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
[nums#212, nums#220, spark_grouping_id#218]
:   +- Project [nums#212, nums#212 AS nums#219]
:  +- Project [value#210 AS nums#212]
: +- SerializeFromObject [input[0, int, false] AS value#210]
:+- ExternalRDD [obj#209]
+- Filter (gid#217 <=> 0)
   +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
agcol#216, spark_grouping_id#218 AS gid#217]
  +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
[nums#212, nums#220, spark_grouping_id#218]
 +- Project [nums#212, nums#212 AS nums#219]
+- Project [value#210 AS nums#212]
   +- SerializeFromObject [input[0, int, false] AS value#210]
  +- ExternalRDD [obj#209]
Conflicting attributes: nums#220
;;
'Join Inner
:- Deduplicate [nums#220]
:  +- Project [nums#220]
: +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
agcol#216, spark_grouping_id#218 AS gid#217]
:+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
[nums#212, nums#220, spark_grouping_id#218]
:   +- Project [nums#212, nums#212 AS nums#219]
:  +- Project [value#210 AS nums#212]
: +- SerializeFromObject [input[0, int, false] AS value#210]
:+- ExternalRDD [obj#209]
+- Filter (gid#217 <=> 0)
   +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
agcol#216, spark_grouping_id#218 AS gid#217]
  +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
[nums#212, nums#220, spark_grouping_id#218]
 +- Project [nums#212, nums#212 AS nums#219]
+- Project [value#210 AS nums#212]
   +- SerializeFromObject [input[0, int, false] AS value#210]
  +- ExternalRDD [obj#209]
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:335)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:96)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
  at 

[jira] [Updated] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-10-31 Thread sandeshyapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29682:
---
Description: 
When I try to self join a parentDf with multiple childDf say childDf1 ... ... 

where childDfs are derived after a cube or rollup and are filtered based on 
group bys,

I get and error 

{{Failure when resolving conflicting references in Join: }}

This shows a long error message which is quite unreadable. On the other hand, 
if I replace cube or rollup with old groupBy, it works without issues.

 

*Sample code:* 
{code:java}
val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")val cubeDF = numsDF
.cube("nums")
.agg(
max(lit(0)).as("agcol"),
grouping_id().as("gid")
)

val group0 = cubeDF.filter(col("gid") <=> lit(0))
val group1 = cubeDF.filter(col("gid") <=> lit(1))cubeDF.printSchema
group0.printSchema
group1.printSchema//Recreating cubeDf
cubeDF.select("nums").distinct
.join(group0, Seq("nums"), "inner")
.join(group1, Seq("nums"), "inner")
.show
{code}
*Sample output:*
{code:java}
numsDF: org.apache.spark.sql.DataFrame = [nums: int]
cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
field]
group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
agcol: int ... 1 more field]
group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
agcol: int ... 1 more field]
root
 |-- nums: integer (nullable = true)
 |-- agcol: integer (nullable = true)
 |-- gid: integer (nullable = false)
root
 |-- nums: integer (nullable = true)
 |-- agcol: integer (nullable = true)
 |-- gid: integer (nullable = false)
root
 |-- nums: integer (nullable = true)
 |-- agcol: integer (nullable = true)
 |-- gid: integer (nullable = false)
org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:
'Join Inner
:- Deduplicate [nums#220]
:  +- Project [nums#220]
: +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
agcol#216, spark_grouping_id#218 AS gid#217]
:+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
[nums#212, nums#220, spark_grouping_id#218]
:   +- Project [nums#212, nums#212 AS nums#219]
:  +- Project [value#210 AS nums#212]
: +- SerializeFromObject [input[0, int, false] AS value#210]
:+- ExternalRDD [obj#209]
+- Filter (gid#217 <=> 0)
   +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
agcol#216, spark_grouping_id#218 AS gid#217]
  +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
[nums#212, nums#220, spark_grouping_id#218]
 +- Project [nums#212, nums#212 AS nums#219]
+- Project [value#210 AS nums#212]
   +- SerializeFromObject [input[0, int, false] AS value#210]
  +- ExternalRDD [obj#209]
Conflicting attributes: nums#220
;;
'Join Inner
:- Deduplicate [nums#220]
:  +- Project [nums#220]
: +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
agcol#216, spark_grouping_id#218 AS gid#217]
:+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
[nums#212, nums#220, spark_grouping_id#218]
:   +- Project [nums#212, nums#212 AS nums#219]
:  +- Project [value#210 AS nums#212]
: +- SerializeFromObject [input[0, int, false] AS value#210]
:+- ExternalRDD [obj#209]
+- Filter (gid#217 <=> 0)
   +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
agcol#216, spark_grouping_id#218 AS gid#217]
  +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
[nums#212, nums#220, spark_grouping_id#218]
 +- Project [nums#212, nums#212 AS nums#219]
+- Project [value#210 AS nums#212]
   +- SerializeFromObject [input[0, int, false] AS value#210]
  +- ExternalRDD [obj#209]
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:335)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:96)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
  at 

[jira] [Updated] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-10-31 Thread sandeshyapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29682:
---
Description: 
When I try to self join a parentDf with multiple childDf say childDf1 ... ... 

where childDfs are derived after a cube or rollup and are filtered based on 
group bys,

I get and error 

{{Failure when resolving conflicting references in Join: }}

This shows a long error message which is quite unreadable. On the other hand, 
if I replace cube or rollup with old groupBy, it works without issues.

 

*Sample code:* 
{code:java}
val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")val cubeDF = numsDF
.cube("nums")
.agg(
max(lit(0)).as("agcol"),
grouping_id().as("gid")
)

val group0 = cubeDF.filter(col("gid") <=> lit(0))
val group1 = cubeDF.filter(col("gid") <=> lit(1))cubeDF.printSchema
group0.printSchema
group1.printSchema//Recreating cubeDf
cubeDF.select("nums").distinct
.join(group0, Seq("nums"), "inner")
.join(group1, Seq("nums"), "inner")
.show
{code}
*Sample output:*
{code:java}
numsDF: org.apache.spark.sql.DataFrame = [nums: int]cubeDF: 
org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
field]group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: 
int, agcol: int ... 1 more field]group1: 
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, agcol: int 
... 1 more field]root |-- nums: integer (nullable = true) |-- agcol: integer 
(nullable = true) |-- gid: integer (nullable = false)root |-- nums: integer 
(nullable = true) |-- agcol: integer (nullable = true) |-- gid: integer 
(nullable = false)root |-- nums: integer (nullable = true) |-- agcol: integer 
(nullable = true) |-- gid: integer (nullable = 
false)org.apache.spark.sql.AnalysisException:Failure when resolving conflicting 
references in Join:'Join Inner:- Deduplicate [nums#220]:  +- Project 
[nums#220]: +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, 
max(0) AS agcol#216, spark_grouping_id#218 AS gid#217]:+- Expand 
[List(nums#212, nums#219, 0), List(nums#212, null, 1)], [nums#212, nums#220, 
spark_grouping_id#218]:   +- Project [nums#212, nums#212 AS nums#219]:  
+- Project [value#210 AS nums#212]: +- 
SerializeFromObject [input[0, int, false] AS value#210]:+- 
ExternalRDD [obj#209]+- Filter (gid#217 <=> 0)   +- Aggregate [nums#220, 
spark_grouping_id#218], [nums#220, max(0) AS agcol#216, spark_grouping_id#218 
AS gid#217]  +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 
1)], [nums#212, nums#220, spark_grouping_id#218] +- Project [nums#212, 
nums#212 AS nums#219]+- Project [value#210 AS nums#212] 
  +- SerializeFromObject [input[0, int, false] AS value#210]  
+- ExternalRDD [obj#209]Conflicting attributes: nums#220;;'Join Inner:- 
Deduplicate [nums#220]:  +- Project [nums#220]: +- Aggregate [nums#220, 
spark_grouping_id#218], [nums#220, max(0) AS agcol#216, spark_grouping_id#218 
AS gid#217]:+- Expand [List(nums#212, nums#219, 0), List(nums#212, 
null, 1)], [nums#212, nums#220, spark_grouping_id#218]:   +- Project 
[nums#212, nums#212 AS nums#219]:  +- Project [value#210 AS 
nums#212]: +- SerializeFromObject [input[0, int, false] AS 
value#210]:+- ExternalRDD [obj#209]+- Filter (gid#217 <=> 
0)   +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
agcol#216, spark_grouping_id#218 AS gid#217]  +- Expand [List(nums#212, 
nums#219, 0), List(nums#212, null, 1)], [nums#212, nums#220, 
spark_grouping_id#218] +- Project [nums#212, nums#212 AS nums#219]  
  +- Project [value#210 AS nums#212]   +- SerializeFromObject 
[input[0, int, false] AS value#210]  +- ExternalRDD [obj#209]  
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96) 
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:335)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125) 
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:96)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:109)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106)
  at 

[jira] [Updated] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-10-31 Thread sandeshyapuram (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeshyapuram updated SPARK-29682:
---
Description: 
When I try to self join a parentDf with multiple childDf say childDf1 ... ... 

where childDfs are derived after a cube or rollup and are filtered based on 
group bys,

I get and error 

{{Failure when resolving conflicting references in Join: }}

This shows a long error message which is quite unreadable. On the other hand, 
if I replace cube or rollup with old groupBy, it works without issues.

 

*Sample code:*
{code:java}
val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")val cubeDF = numsDF
.cube("nums")
.agg(
max(lit(0)).as("agcol"),
grouping_id().as("gid")
)

val group0 = cubeDF.filter(col("gid") <=> lit(0))
val group1 = cubeDF.filter(col("gid") <=> lit(1))cubeDF.printSchema
group0.printSchema
group1.printSchema//Recreating cubeDf
cubeDF.select("nums").distinct
.join(group0, Seq("nums"), "inner")
.join(group1, Seq("nums"), "inner")
.show{code}

  was:
When I try to self join a parentDf with multiple childDf say childDf1 ... ... 

where childDfs are derived after a cube or rollup and are filtered based on 
group bys,

I get and error 

{{Failure when resolving conflicting references in Join: }}

This shows a long error message which is quite unreadable. On the other hand, 
if I replace cube or rollup with old groupBy, it works without issues.

 

*Sample code:*

 

 
{code:java}
val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")val cubeDF = numsDF
.cube("nums")
.agg(
max(lit(0)).as("agcol"),
grouping_id().as("gid")
)

val group0 = cubeDF.filter(col("gid") <=> lit(0))
val group1 = cubeDF.filter(col("gid") <=> lit(1))cubeDF.printSchema
group0.printSchema
group1.printSchema//Recreating cubeDf
cubeDF.select("nums").distinct
.join(group0, Seq("nums"), "inner")
.join(group1, Seq("nums"), "inner")
.show{code}


> Failure when resolving conflicting references in Join:
> --
>
> Key: SPARK-29682
> URL: https://issues.apache.org/jira/browse/SPARK-29682
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> When I try to self join a parentDf with multiple childDf say childDf1 ... ... 
> where childDfs are derived after a cube or rollup and are filtered based on 
> group bys,
> I get and error 
> {{Failure when resolving conflicting references in Join: }}
> This shows a long error message which is quite unreadable. On the other hand, 
> if I replace cube or rollup with old groupBy, it works without issues.
>  
> *Sample code:*
> {code:java}
> val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")val cubeDF = numsDF
> .cube("nums")
> .agg(
> max(lit(0)).as("agcol"),
> grouping_id().as("gid")
> )
> 
> val group0 = cubeDF.filter(col("gid") <=> lit(0))
> val group1 = cubeDF.filter(col("gid") <=> lit(1))cubeDF.printSchema
> group0.printSchema
> group1.printSchema//Recreating cubeDf
> cubeDF.select("nums").distinct
> .join(group0, Seq("nums"), "inner")
> .join(group1, Seq("nums"), "inner")
> .show{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-10-31 Thread sandeshyapuram (Jira)
sandeshyapuram created SPARK-29682:
--

 Summary: Failure when resolving conflicting references in Join:
 Key: SPARK-29682
 URL: https://issues.apache.org/jira/browse/SPARK-29682
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.3
Reporter: sandeshyapuram


When I try to self join a parentDf with multiple childDf say childDf1 ... ... 

where childDfs are derived after a cube or rollup and are filtered based on 
group bys,

I get and error 

{{Failure when resolving conflicting references in Join: }}

This shows a long error message which is quite unreadable. On the other hand, 
if I replace cube or rollup with old groupBy, it works without issues.

 

*Sample code:*

 

 
{code:java}
val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")val cubeDF = numsDF
.cube("nums")
.agg(
max(lit(0)).as("agcol"),
grouping_id().as("gid")
)

val group0 = cubeDF.filter(col("gid") <=> lit(0))
val group1 = cubeDF.filter(col("gid") <=> lit(1))cubeDF.printSchema
group0.printSchema
group1.printSchema//Recreating cubeDf
cubeDF.select("nums").distinct
.join(group0, Seq("nums"), "inner")
.join(group1, Seq("nums"), "inner")
.show{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org