[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048262#comment-15048262
 ] 

Fengdong Yu commented on SPARK-12233:
-

I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048264#comment-15048264
 ] 

Fengdong Yu commented on SPARK-12233:
-

I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048265#comment-15048265
 ] 

Fengdong Yu commented on SPARK-12233:
-

I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-09 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048263#comment-15048263
 ] 

Fengdong Yu commented on SPARK-12233:
-

I am using :
select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr"))

to hit the error.


But you are right. I cannot use the old DF to select the element after the join



should close now. Thanks

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> 

[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048232#comment-15048232
 ] 

Xiao Li commented on SPARK-12233:
-

I am unable to reproduce your error. Try to run my example, can you hit your 
error? 

{code}
sqlContext.udf.register("lowercase", (s: String) =>{
  if (null == s) "" else s.toLowerCase
})
val df1 = Seq(1, 2, 3).map(i => (i, i.toString, i.toString)).toDF("int", 
"str1", "str2")
df1.registerTempTable("testTable")
val df3 = sqlContext.sql(""" SELECT lowercase(str2) AS emailaddr, int, 
str1, str2
   FROM testTable""").distinct()
val df4 = df3.groupBy("int", "str1", "str2").count()
val res = df4.where("count > 1").drop(count("count")).join(df3, Seq("int", 
"str1", "str2")).select("int", "str1", "str2", "emailaddr").collect()
{code}

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>  

[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048079#comment-15048079
 ] 

holdenk commented on SPARK-12233:
-

Could you maybe so what happens with the "wrong" example? Also it seems like 
some parts may have gotten lost (e.g. there is no "s around the SQL statement 
and the brackets don't balance, etc.) - maybe double check the repro example?

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> Background:
> two tables: 
> tableA(id string, name string, gender string)
> tableB(id string, name string)
> {code}
> val df1 = sqlContext.sql(select * from tableA)
> val df2 = sqlContext.sql(select * from tableB)
> //Wrong
> df1.join(df2, Seq("id", "name").select(df2("id"), df2("name"), df1("gender"))
> //Correct
> df1.join(df2, Seq("id", "name").select("id", "name", "gender")
> {code}
> 
> Cannot specify column of data frame for 'gender'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048087#comment-15048087
 ] 

Xiao Li commented on SPARK-12233:
-

Please post the error message you got. Thanks! 

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> Background:
> two tables: 
> tableA(id string, name string, gender string)
> tableB(id string, name string)
> {code}
> val df1 = sqlContext.sql("select * from tableA")
> val df2 = sqlContext.sql("select * from tableB")
> //Wrong
> df1.join(df2, Seq("id", "name")).select(df2("id"), df2("name"), df1("gender"))
> //Correct
> df1.join(df2, Seq("id", "name").select("id", "name", "gender")
> {code}
> 
> Cannot specify column of data frame for 'gender'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048096#comment-15048096
 ] 

Xiao Li commented on SPARK-12233:
-

This is another self join issue. I will try to see if it is a well-known issue. 

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> 

[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048186#comment-15048186
 ] 

Xiao Li commented on SPARK-12233:
-

{code}
val df1 = Seq(1, 2, 3).map(i => (i, i.toString, i.toString)).toDF("int", 
"str1", "str2")
val df2 = Seq(1, 2, 3).map(i => (i, i.toString)).toDF("int", "str1")

val res = df1.join(df2, Seq("int", "str1")).select(df2("int"), df2("str1"), 
df1("str2")).collect()
{code}

I am trying to reproduce it using your old example. Let me explain why it 
failed. After the join, it will generate a new data frame. You are unable to 
use the old dataFrame name to select the element in new dataframe.  

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
> {code}
> {color:red} .select(count("given_name"), count("family_name"), 
> extracted("emailaddr"))  {color}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> 

[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join

2015-12-08 Thread Fengdong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048091#comment-15048091
 ] 

Fengdong Yu commented on SPARK-12233:
-

Updated. [~smilegator] [~holdenk]

> Cannot specify a data frame column during join
> --
>
> Key: SPARK-12233
> URL: https://issues.apache.org/jira/browse/SPARK-12233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Fengdong Yu
>Priority: Minor
>
> {code}
> sqlContext.udf.register("lowercase", (s: String) =>{
>   if (null == s) "" else s.toLowerCase
> })
> 
> sqlContext.udf.register("substr", (s: String) =>{
>   if (null == s) ""
>   else {
> val index = s.indexOf("@")
> if (index < 0) s else s.toLowerCase.substring(index + 1)}
> })
> 
> sqlContext.read.orc("/data/test/test.data")
> .registerTempTable("testTable")
> 
> val extracted = 
> sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, 
> lowercase(family_name) AS 
> family_name, 
> substr(email_address) AS domain, 
> lowercase(email_address) AS emailaddr,
> experience
>   
>  FROM testTable 
>  WHERE email_address != '' 
>  """)
>  .distinct
> 
> val count =
>  extracted.groupBy("given_name", "family_name", "domain")
>.count
> 
> count.where(count("count") > 1)
>  .drop(count("count"))
>  .join(extracted, Seq("given_name", "family_name", "domain"))
>  {color:red}.select(count("given_name"), count("family_name"), 
> extracted("emailaddr")) {color}
> .show
> {code}
> Red Font should be:
> {color:red} select("given_name", "family_name", "emailaddr") {color}
> 
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 
> missing from 
> given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490
>  in operator !Project [given_name#522,family_name#523,emailaddr#525];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC.(:61)
>   at $iwC.(:63)
>   at (:65)
>   at .(:69)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>